project subject :
Building different forecasting models for forecasting is one of the variables of dataset.
The project is coded with R
project based on Crisp_dm model.

CRISP_DM Model

Cross-industry standard process for data mining, known as CRISP-DM, is an open standard process model that describes common approaches used by data mining experts. It is the most widely-used analytics model.
CRISP-DM breaks the process of data mining into six major phases:

  1. Business Understanding
  2. Data Understanding
  3. Data Preparation
  4. Modeling
  5. Evaluation
  6. Deployment
    Crisp_Dm Model

1.Problem Understanding & 2.Data Understanding

These two steps overlap, so let’s check together.

Questions to understand the problem

  1. What is the main motivation of such a project?
  2. What might the output of such a project be used for?
  3. Who might be interested in the results of this project? Why?

Answer:This project is to predict the number of applications received by various American universities. The output of this project can be used to separate different universities from each other, these results can be attractive for the Ministry of Education. Popular universities can be identified to give them more funding or to plan for other universities to reach the level of top universities, and it can be attractive for students to choose a university that has a chance of being accepted.

Data

College : U.S. News and World Report’s College Data
Statistics for a large number of US Colleges from the 1995 issue of US News and World Report.https://rdrr.io/cran/ISLR/man/College.html

college_data <- read.csv("college.csv")
head(college_data)
##                              X Private Apps Accept Enroll Top10perc Top25perc
## 1 Abilene Christian University     Yes 1660   1232    721        23        52
## 2           Adelphi University     Yes 2186   1924    512        16        29
## 3               Adrian College     Yes 1428   1097    336        22        50
## 4          Agnes Scott College     Yes  417    349    137        60        89
## 5    Alaska Pacific University     Yes  193    146     55        16        44
## 6            Albertson College     Yes  587    479    158        38        62
##   F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal
## 1        2885         537     7440       3300   450     2200  70       78
## 2        2683        1227    12280       6450   750     1500  29       30
## 3        1036          99    11250       3750   400     1165  53       66
## 4         510          63    12960       5450   450      875  92       97
## 5         249         869     7560       4120   800     1500  76       72
## 6         678          41    13500       3335   500      675  67       73
##   S.F.Ratio perc.alumni Expend Grad.Rate
## 1      18.1          12   7041        60
## 2      12.2          16  10527        56
## 3      12.9          30   8735        54
## 4       7.7          37  19016        59
## 5      11.9           2  10922        15
## 6       9.4          11   9727        55
# Remove the first column
college_data <- college_data[,-1]
dim(college_data)
## [1] 777  18

A data frame with 777 observations on the following 18 variables.

Dataset Variable Definition

  • Private: A factor with levels No and Yes indicating private or public university.
  • Apps: Number of applications received. (The variable to be predicted)
  • Accept: Number of applications accepted.
  • Enroll: Number of new students enrolled.
  • Top10perc: Pct. new students from top 10% of H.S. class.
  • Top25perc: Pct. new students from top 25% of H.S. class.
  • F.Undergrad: Number of fulltime undergraduates.
  • P.Undergrad: Number of parttime undergraduates.
  • Outstate: Out-of-state tuition.
  • Room.Board: Room and board costs.
  • Books: Estimated book costs.
  • Personal: Estimated personal spending.
  • PhD: Pct.of faculty with Ph.D.’s.
  • Terminal: Pct. of faculty with terminal degree.
  • S.F.Ratio: Student/faculty ratio.
  • perc.alumni: Pct. alumni who donate.
  • Expend: Instructional expenditure per student.
  • Grad.Rate: Graduation rate.

Source

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the ASA Statistical Graphics Section’s 1995 Data Analysis Exposition.

Data Understanding from Statistical Perspective

class(college_data)
## [1] "data.frame"
str(college_data)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Apps       : int  1660 2186 1428 417 193 587 353 1899 1038 582 ...
##  $ Accept     : int  1232 1924 1097 349 146 479 340 1720 839 498 ...
##  $ Enroll     : int  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : int  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : int  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: int  2885 2683 1036 510 249 678 416 1594 973 799 ...
##  $ P.Undergrad: int  537 1227 99 63 869 41 230 32 306 78 ...
##  $ Outstate   : int  7440 12280 11250 12960 7560 13500 13290 13868 15595 10468 ...
##  $ Room.Board : int  3300 6450 3750 5450 4120 3335 5720 4826 4400 3380 ...
##  $ Books      : int  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : int  2200 1500 1165 875 1500 675 1500 850 500 1800 ...
##  $ PhD        : int  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : int  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: int  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : int  7041 10527 8735 19016 10922 9727 8861 11487 11644 8991 ...
##  $ Grad.Rate  : int  60 56 54 59 15 55 63 73 80 52 ...

Categorical variable :

  • private

Integer variables :

  • Apps
  • Accept
  • Enroll
  • F.Undergrad
  • P.Undergrad
  • Outstate
  • Room.Board
  • Books
  • Books
  • Personal
  • Expend

Real number :

  • S.F.Ratio

Percentage numbers :

  • Top10perc
  • Top25perc
  • PhD
  • Terminal
  • perc.alumni
  • Grad.Rate

The range of these numbers is between 0 and 100

summary(college_data)
##    Private               Apps           Accept          Enroll    
##  Length:777         Min.   :   81   Min.   :   72   Min.   :  35  
##  Class :character   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242  
##  Mode  :character   Median : 1558   Median : 1110   Median : 434  
##                     Mean   : 3002   Mean   : 2019   Mean   : 780  
##                     3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902  
##                     Max.   :48094   Max.   :26330   Max.   :6392  
##    Top10perc       Top25perc      F.Undergrad     P.Undergrad     
##  Min.   : 1.00   Min.   :  9.0   Min.   :  139   Min.   :    1.0  
##  1st Qu.:15.00   1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0  
##  Median :23.00   Median : 54.0   Median : 1707   Median :  353.0  
##  Mean   :27.56   Mean   : 55.8   Mean   : 3700   Mean   :  855.3  
##  3rd Qu.:35.00   3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0  
##  Max.   :96.00   Max.   :100.0   Max.   :31643   Max.   :21836.0  
##     Outstate       Room.Board       Books           Personal   
##  Min.   : 2340   Min.   :1780   Min.   :  96.0   Min.   : 250  
##  1st Qu.: 7320   1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850  
##  Median : 9990   Median :4200   Median : 500.0   Median :1200  
##  Mean   :10441   Mean   :4358   Mean   : 549.4   Mean   :1341  
##  3rd Qu.:12925   3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700  
##  Max.   :21700   Max.   :8124   Max.   :2340.0   Max.   :6800  
##       PhD            Terminal       S.F.Ratio      perc.alumni   
##  Min.   :  8.00   Min.   : 24.0   Min.   : 2.50   Min.   : 0.00  
##  1st Qu.: 62.00   1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00  
##  Median : 75.00   Median : 82.0   Median :13.60   Median :21.00  
##  Mean   : 72.66   Mean   : 79.7   Mean   :14.09   Mean   :22.74  
##  3rd Qu.: 85.00   3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00  
##  Max.   :103.00   Max.   :100.0   Max.   :39.80   Max.   :64.00  
##      Expend        Grad.Rate     
##  Min.   : 3186   Min.   : 10.00  
##  1st Qu.: 6751   1st Qu.: 53.00  
##  Median : 8377   Median : 65.00  
##  Mean   : 9660   Mean   : 65.46  
##  3rd Qu.:10830   3rd Qu.: 78.00  
##  Max.   :56233   Max.   :118.00

Note: privet must be convert to categorical. and PhD and Grad.Rate variables have exceeded their limit

Univariate Profiling

1_ Private: A factor with levels No and Yes indicating private or public university.

#convert to factor
college_data$Private <- factor(college_data$Private)
summary(college_data$Private)
##  No Yes 
## 212 565
library("ggplot2")
ggplot(data = college_data, aes(x = Private , y = stat(prop * 100), group = 1)) +
   geom_bar(fill= c("dark blue", "dark red")) +
   ylab("percentage") +
   scale_y_continuous(breaks = seq(0,100,10),limits = c(0,100)) +
   ggtitle("Private university or not")+ 
   theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size    = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))

private number percent
NO 212 27.3%
YES 565 72.7%

Note: We are not symmetrically connected with different universities.

2_ Apps: Number of applications received. (The variable to be predicted)

summary(college_data$Apps)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      81     776    1558    3002    3624   48094
#Test of normality
#1.Histogram
hist(college_data$Apps, probability = T, breaks = 20,
     main = "Apps hist", xlab = "Apps")
lines(density(college_data$Apps), col = "red")

#2.QQplot
qqnorm(college_data$Apps, main = "QQ plot of Apps", pch = 20)
qqline(college_data$Apps, col = "red")

#3.Test Skewness and Kurtosis
library("moments")
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$Apps)
## 
##  Jarque-Bera Normality Test
## 
## data:  college_data$Apps
## JB = 24687, p-value < 2.2e-16
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$Apps)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  college_data$Apps
## kurt = 29.595, z = 15.195, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
college_data[college_data$Apps > 22000,]
##     Private  Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
## 484      No 48094  26330   4520        36        79       21401        3712
##     Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni
## 484     7410       4748   690     2009  90       95      19.5          19
##     Expend Grad.Rate
## 484  10474        77
college_data <- college_data[-which(row.names(college_data)== 484), ]

hist(college_data$Apps, probability = T, breaks = 20,main = "Apps hist", xlab = "Apps", ylim = c(0,0.00025))
lines(density(college_data$Apps), col = "red")

conclusion: reject normality assumption.

boxplot(college_data$Apps, main = "Apps boxplot")

Note: The Apps variable has a lot of Outliers.

3_ Accept: Number of applications accepted.

summary(college_data$Accept)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    72.0   603.2  1109.5  1987.5  2407.5 18744.0
#Test of normality
#1.Histogram
hist(college_data$Accept, probability = T, breaks = 15,
     main = "Accept hist", xlab = "Accept")
lines(density(college_data$Accept), col = "red")

#2.QQplot
qqnorm(college_data$Accept, main = "QQ plot of Accept", pch = 20)
qqline(college_data$Accept, col = "red")

#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$Accept)
## 
##  Jarque-Bera Normality Test
## 
## data:  college_data$Accept
## JB = 3759.8, p-value < 2.2e-16
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$Accept)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  college_data$Accept
## kurt = 12.359, z = 11.912, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3

conclusion: reject normality assumption.

boxplot(college_data$Accept, main = "Accept boxplot")

Note: The Accept variable has a lot of Outliers.

4_ Enroll: Number of new students enrolled.

summary(college_data$Enroll)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    35.0   242.0   434.0   775.2   893.8  6392.0
#Test of normality
#1.Histogram
hist(college_data$Enroll, probability = T, breaks = 15,
     main = "Enroll hist", xlab = "Enroll")
lines(density(college_data$Enroll), col = "red")

#2.QQplot
qqnorm(college_data$Enroll, main = "QQ plot of Enroll", pch = 20)
qqline(college_data$Enroll, col = "red")

#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$Enroll)
## 
##  Jarque-Bera Normality Test
## 
## data:  college_data$Enroll
## JB = 3539.4, p-value < 2.2e-16
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$Enroll)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  college_data$Enroll
## kurt = 11.963, z = 11.761, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3

conclusion: reject normality assumption.

boxplot(college_data$Enroll, main = "Enroll boxplot")

Note: The Enroll variable has a lot of Outliers.

5_ Top10perc: Pct. new students from top 10% of H.S. class.

summary(college_data$Top10perc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   15.00   23.00   27.55   35.00   96.00
#Test of normality
#1.Histogram
hist(college_data$Top10perc, probability = T, breaks = 15,
     main = "Top10perc hist", xlab = "Top10perc")
lines(density(college_data$Top10perc), col = "red")

#2.QQplot
qqnorm(college_data$Top10perc, main = "QQ plot of Top10perc", pch = 20)
qqline(college_data$Top10perc, col = "red")

#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$Top10perc)
## 
##  Jarque-Bera Normality Test
## 
## data:  college_data$Top10perc
## JB = 412.33, p-value < 2.2e-16
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$Top10perc)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  college_data$Top10perc
## kurt = 5.1860, z = 6.5872, p-value = 4.482e-11
## alternative hypothesis: kurtosis is not equal to 3

conclusion: reject normality assumption.

boxplot(college_data$Top10perc, main = "Top10perc boxplot")

Note: The Top10perc variable has a lot of Outliers.

6_ Top25perc: Pct. new students from top 25% of H.S. class.

summary(college_data$Top25perc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   41.00   54.00   55.77   69.00  100.00
#Test of normality
#1.Histogram
hist(college_data$Top25perc, probability = T, breaks = 15,
     main = "Top25perc hist", xlab = "Top25perc")
lines(density(college_data$Top25perc), col = "red")

#2.QQplot
qqnorm(college_data$Top25perc, main = "QQ plot of Top25perc", pch = 20)
qqline(college_data$Top25perc, col = "red")

#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$Top25perc)
## 
##  Jarque-Bera Normality Test
## 
## data:  college_data$Top25perc
## JB = 19.135, p-value = 6.995e-05
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$Top25perc)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  college_data$Top25perc
## kurt = 2.4364, z = -4.4374, p-value = 9.103e-06
## alternative hypothesis: kurtosis is not equal to 3

conclusion: reject normality assumption.

boxplot(college_data$Top25perc, main = "Top25perc boxplot")

Note: The Top25perc variable hasn’t Outliers.

7_ F.Undergrad: Number of fulltime undergraduates.

  • Full-time undergraduate students are defined as those taking nine credits or more per academic term.
summary(college_data$F.Undergrad)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     139     991    1707    3677    3969   31643
#Test of normality
#1.Histogram
hist(college_data$F.Undergrad, probability = T, breaks = 15,
     main = "F.Undergrad hist", xlab = "F.Undergrad")
lines(density(college_data$F.Undergrad), col = "red")

#2.QQplot
qqnorm(college_data$F.Undergrad, main = "QQ plot of F.Undergrad", pch = 20)
qqline(college_data$F.Undergrad, col = "red")

#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$F.Undergrad)
## 
##  Jarque-Bera Normality Test
## 
## data:  college_data$F.Undergrad
## JB = 2863.3, p-value < 2.2e-16
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$F.Undergrad)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  college_data$F.Undergrad
## kurt = 10.814, z = 11.275, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3

conclusion: reject normality assumption.

boxplot(college_data$F.Undergrad, main = "F.Undergrad boxplot")

Note: The F.Undergrad variable has a lot of Outliers.

8_ P.Undergrad: Number of parttime undergraduates.

  • Part-time undergraduate students are defined as those taking fewer than nine credits per academic term.
summary(college_data$P.Undergrad)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    95.0   352.5   851.6   964.0 21836.0
#Test of normality
#1.Histogram
hist(college_data$P.Undergrad, probability = T, breaks = 15,
     main = "P.Undergrad hist", xlab = "P.Undergrad")
lines(density(college_data$P.Undergrad), col = "red")

#2.QQplot
qqnorm(college_data$P.Undergrad, main = "QQ plot of P.Undergrad", pch = 20)
qqline(college_data$P.Undergrad, col = "red")

#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$P.Undergrad)
## 
##  Jarque-Bera Normality Test
## 
## data:  college_data$P.Undergrad
## JB = 102622, p-value < 2.2e-16
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$P.Undergrad)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  college_data$P.Undergrad
## kurt = 58.165, z = 17.007, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3

conclusion: reject normality assumption.

boxplot(college_data$P.Undergrad, main = "P.Undergrad boxplot")

Note: The P.Undergrad variable has a lot of Outliers.

9_ Outstate: Out-of-state tuition.

  • Out-of-state tuition refers to the rate that students coming from outside the state, including international students, pay to attend a public state school. In-state tuition is typically much cheaper than out-of-state tuition.
summary(college_data$Outstate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2340    7305    9990   10445   12931   21700
#Test of normality
#1.Histogram
hist(college_data$Outstate, probability = T, breaks = 15,
     main = "Outstate hist", xlab = "Outstate")
lines(density(college_data$Outstate), col = "red")

#2.QQplot
qqnorm(college_data$Outstate, main = "QQ plot of Outstate", pch = 20)
qqline(college_data$Outstate, col = "red")

#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$Outstate)
## 
##  Jarque-Bera Normality Test
## 
## data:  college_data$Outstate
## JB = 38.861, p-value = 3.642e-09
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$Outstate)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  college_data$Outstate
## kurt = 2.5792, z = -2.9480, p-value = 0.003199
## alternative hypothesis: kurtosis is not equal to 3

conclusion: reject normality assumption.

boxplot(college_data$Outstate, main = "Outstate boxplot")

Note: The P.Undergrad variable has a One Outliers.

10_ Room.Board: Room and board costs.

  • Room and board is a phrase describing a situation in which, in exchange for money, labor or other considerations, a person is provided with a place to live as well as meals on a comprehensive basis.
summary(college_data$Room.Board)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1780    3596    4198    4357    5050    8124
#Test of normality
#1.Histogram
hist(college_data$Room.Board, probability = T, breaks = 15,
     main = "Room.Board hist", xlab = "Room.Board")
lines(density(college_data$Room.Board), col = "red")

#2.QQplot
qqnorm(college_data$Room.Board, main = "QQ plot of Room.Board", pch = 20)
qqline(college_data$Room.Board, col = "red")

#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$Room.Board)
## 
##  Jarque-Bera Normality Test
## 
## data:  college_data$Room.Board
## JB = 30.737, p-value = 2.116e-07
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$Room.Board)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  college_data$Room.Board
## kurt = 2.8041, z = -1.1201, p-value = 0.2627
## alternative hypothesis: kurtosis is not equal to 3

conclusion: reject normality assumption.

boxplot(college_data$Room.Board, main = "Room.Board boxplot")

Note: The Room.Board variable has a lot of Outliers.

11_ Books: Estimated book costs.

summary(college_data$Books)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    96.0   469.5   500.0   549.2   600.0  2340.0
#Test of normality
#1.Histogram
hist(college_data$Books, probability = T, breaks = 15,
     main = "Books hist", xlab = "Books", ylim = c(0, 0.005))
lines(density(college_data$Books), col = "red")

#2.QQplot
qqnorm(college_data$Books, main = "QQ plot of Books", pch = 20)
qqline(college_data$Books, col = "red")

#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$Books)
## 
##  Jarque-Bera Normality Test
## 
## data:  college_data$Books
## JB = 27239, p-value < 2.2e-16
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$Books)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  college_data$Books
## kurt = 31.176, z = 15.344, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3

conclusion: reject normality assumption.

boxplot(college_data$Books, main = "Books boxplot")

Note: The Books variable has a lot of Outliers.

12_ Personal: Estimated personal spending.

summary(college_data$Personal)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     250     850    1200    1340    1692    6800
#Test of normality
#1.Histogram
hist(college_data$Personal, probability = T, breaks = 15,
     main = "Personal hist", xlab = "Personal")
lines(density(college_data$Personal), col = "red")

#2.QQplot
qqnorm(college_data$Personal, main = "QQ plot of Personal", pch = 20)
qqline(college_data$Personal, col = "red")

#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$Personal)
## 
##  Jarque-Bera Normality Test
## 
## data:  college_data$Personal
## JB = 2018.9, p-value < 2.2e-16
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$Personal)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  college_data$Personal
## kurt = 10.091, z = 10.926, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3

conclusion: reject normality assumption.

boxplot(college_data$Personal, main = "Personal boxplot")

Note: The Books variable has a lot of Outliers.

13_ PhD: Pct.of faculty with Ph.D.’s.

  • percentage of faculty with a PhD may be used as a university ratings measure.
summary(college_data$PhD)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00   62.00   75.00   72.64   85.00  103.00

Note: The PhD variables have exceeded their limit.

#Test of normality
#1.Histogram
hist(college_data$PhD, probability = T, breaks = 15,
     main = "PhD hist", xlab = "PhD")
lines(density(college_data$PhD), col = "red")

#2.QQplot
qqnorm(college_data$PhD, main = "QQ plot of PhD", pch = 20)
qqline(college_data$PhD, col = "red")

#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$PhD)
## 
##  Jarque-Bera Normality Test
## 
## data:  college_data$PhD
## JB = 85.651, p-value < 2.2e-16
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$PhD)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  college_data$PhD
## kurt = 3.5534, z = 2.6620, p-value = 0.007767
## alternative hypothesis: kurtosis is not equal to 3

conclusion: reject normality assumption. PhD variables have exceeded their limit.

boxplot(college_data$PhD, main = "PhD boxplot")

Note: The Books variable has a lot of Outliers.

college_data <- college_data[-which(college_data$PhD > 100), ]

14_ Terminal: Pct. of faculty with terminal degree.

  • A terminal degree is the term used to describe the highest degree available in any given academic discipline. When a student graduates from their chosen academic program with a terminal degree, it means that they have reached the highest level of education available in their chosen field.
summary(college_data$Terminal)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   24.00   71.00   82.00   79.67   92.00  100.00
#Test of normality
#1.Histogram
hist(college_data$Terminal, probability = T, breaks = 15,
     main = "Terminal hist", xlab = "Terminal")
lines(density(college_data$Terminal), col = "red")

#2.QQplot
qqnorm(college_data$Terminal, main = "QQ plot of Terminal", pch = 20)
qqline(college_data$Terminal, col = "red")

#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$Terminal)
## 
##  Jarque-Bera Normality Test
## 
## data:  college_data$Terminal
## JB = 86.756, p-value < 2.2e-16
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$Terminal)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  college_data$Terminal
## kurt = 3.2277, z = 1.3114, p-value = 0.1897
## alternative hypothesis: kurtosis is not equal to 3

conclusion: reject normality assumption.

boxplot(college_data$Terminal, main = "Terminal boxplot")

Note: The Terminal variable has a lot of Outliers.

15_ S.F.Ratio: Student/faculty ratio.

  • Student–teacher ratio or student–faculty ratio is the number of students who attend a school or university divided by the number of teachers in the institution. For example, a student–teacher ratio of 10:1 indicates that there are 10 students for every one teacher.
summary(college_data$S.F.Ratio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.50   11.50   13.60   14.08   16.45   39.80
#Test of normality
#1.Histogram
hist(college_data$S.F.Ratio, probability = T, breaks = 15,
     main = "S.F.Ratio hist", xlab = "S.F.Ratio")
lines(density(college_data$S.F.Ratio), col = "red")

#2.QQplot
qqnorm(college_data$S.F.Ratio, main = "QQ plot of S.F.Ratio", pch = 20)
qqline(college_data$S.F.Ratio, col = "red")

#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$S.F.Ratio)
## 
##  Jarque-Bera Normality Test
## 
## data:  college_data$S.F.Ratio
## JB = 270.49, p-value < 2.2e-16
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$S.F.Ratio)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  college_data$S.F.Ratio
## kurt = 5.5621, z = 7.1525, p-value = 8.521e-13
## alternative hypothesis: kurtosis is not equal to 3

conclusion: reject normality assumption.

boxplot(college_data$S.F.Ratio, main = "S.F.Ratio boxplot")

Note: The S.F.Ratio variable has a lot of Outliers.

16_ perc.alumni: Pct. alumni who donate.

summary(college_data$perc.alumni)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   13.00   21.00   22.76   31.00   64.00
#Test of normality
#1.Histogram
hist(college_data$perc.alumni, probability = T, breaks = 15,
     main = "perc.alumni hist", xlab = "perc.alumni")
lines(density(college_data$perc.alumni), col = "red")

#2.QQplot
qqnorm(college_data$perc.alumni, main = "QQ plot of perc.alumni", pch = 20)
qqline(college_data$perc.alumni, col = "red")

#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$perc.alumni)
## 
##  Jarque-Bera Normality Test
## 
## data:  college_data$perc.alumni
## JB = 47.266, p-value = 5.448e-11
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$perc.alumni)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  college_data$perc.alumni
## kurt = 2.88878, z = -0.54807, p-value = 0.5836
## alternative hypothesis: kurtosis is not equal to 3

conclusion: reject normality assumption.

boxplot(college_data$perc.alumni, main = "perc.alumni boxplot")

Note: The S.F.Ratio variable has a few of Outliers.

17_ Expend: Instructional expenditure per student.

  • instructional spending per student is defined as total instructional spending divided by total full-time equivalent student enrollment. Total instructional spending is the amount each institution spends on the units that run its educational programs.
summary(college_data$Expend)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3186    6754    8377    9663   10847   56233
#Test of normality
#1.Histogram
hist(college_data$Expend, probability = T, breaks = 15,
     main = "Expend hist", xlab = "Expend")
lines(density(college_data$Expend), col = "red")

#2.QQplot
qqnorm(college_data$Expend, main = "QQ plot of Expend", pch = 20)
qqline(college_data$Expend, col = "red")

#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$Expend)
## 
##  Jarque-Bera Normality Test
## 
## data:  college_data$Expend
## JB = 12711, p-value < 2.2e-16
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$Expend)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  college_data$Expend
## kurt = 21.602, z = 14.144, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3

conclusion: reject normality assumption.

boxplot(college_data$Expend, main = "Expend boxplot")

Note: The S.F.Ratio variable has a lot of Outliers.

18_ Grad.Rate: Graduation rate.

  • A graduation rate is a measure of how many students who began in the same cohort will graduate in six years for four-year programs or three years for two-year programs. This rate indicates how many students finish their degrees in a timely manner upon enrolling. It can also help prospective students measure the quality of a school, since higher graduation rates may indicate that students’ resources, time and investment in a program at a particular school will likely be worth it.
summary(college_data$Grad.Rate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   53.00   65.00   65.48   78.00  118.00

Note: The Grad.Rate variables have exceeded their limit.

#Test of normality
#1.Histogram
hist(college_data$Grad.Rate, probability = T, breaks = 15,
     main = "Grad.Rate hist", xlab = "Grad.Rate")
lines(density(college_data$Grad.Rate), col = "red")

#2.QQplot
qqnorm(college_data$Grad.Rate, main = "QQ plot of Grad.Rate", pch = 20)
qqline(college_data$Grad.Rate, col = "red")

#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$Grad.Rate)
## 
##  Jarque-Bera Normality Test
## 
## data:  college_data$Grad.Rate
## JB = 3.0568, p-value = 0.2169
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$Grad.Rate)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  college_data$Grad.Rate
## kurt = 2.7931, z = -1.1975, p-value = 0.2311
## alternative hypothesis: kurtosis is not equal to 3

conclusion: accept normality assumption. The Grad.Rate variable follows the normal distribution. Grad.Rate variables have exceeded their limit.

boxplot(college_data$Grad.Rate, main = "Grad.Rate boxplot")

Note: The S.F.Ratio variable has a few of Outliers.

college_data <- college_data[-which(college_data$Grad.Rate > 100), ]

plot all 17 variables measuring perception

par(mfrow = c(2, 3))
for (i in 2:7) {
   hist(college_data[,i], xlab ="", main = colnames(college_data)[i],probability = T)
   lines(density(college_data[,i]), col = "red")
}

for (i in 8:13) {
   hist(college_data[,i], xlab ="", main = colnames(college_data)[i],probability = T)
   lines(density(college_data[,i]), col = "red")
}

for (i in 14:18) {
   hist(college_data[,i], xlab ="", main = colnames(college_data)[i],probability = T)
   lines(density(college_data[,i]), col = "red")
}
#reset graphical parameters.
par(mfrow = c(1, 1), mar = c(5, 4, 4, 2) + 0.1, cex.main = 1)

par(mfrow = c(2, 3), mar = c(2,3, 2, 1.5) )
for (i in 2:7) {
   boxplot(college_data[,i], xlab ="", main = colnames(college_data)[i])
}

for (i in 8:13) {
   boxplot(college_data[,i], xlab ="", main = colnames(college_data)[i])
}

for (i in 14:18) {
   boxplot(college_data[,i], xlab ="", main = colnames(college_data)[i])
}
par(mfrow = c(1, 1), mar = c(5, 4, 4, 2) + 0.1, cex.main = 1)

Bivariate Profiling

Categorical vs. Numerical Variables:

Private vs. Apps
tapply(college_data$Apps,college_data$Private,mean)
##       No      Yes 
## 5552.952 1974.615

Conclusion: Private universities received more applications.

#Histogram for all data in one plot
ggplot(college_data, aes(x = Apps, fill = Private)) +
     geom_histogram(bins = 25, alpha = 0.5, position = "dodge") +
     ggtitle("Distribute the Apps variable based on the Private variable") + 
    theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color     = "black"),axis.line.y = element_line(size = 1, color = "black"))

#Histogram for all data in two plot
 ggplot(college_data, aes(x = Apps)) +
     geom_histogram(bins = 25, fill = "light blue") +
     facet_grid(Private ~ .) + 
    ggtitle("Distribute the Apps variable based on the Private variable") +
    theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color     = "black"),axis.line.y = element_line(size = 1, color = "black"))

 #Box plot 
 boxplot(Apps ~ Private, data = college_data, names = levels(college_data$Private), main = "box plot Apps based on the Private")

conclusion: The variable distribution of the Apps varies in different Privet factors

#independent 2_Group mann_whiteny U Test
 wilcox.test(Apps ~ Private, data = college_data, alternative = "two.sided", paired = FALSE)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Apps by Private
## W = 95738, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Note: p_value < 0.05. Indicates that Privat variable has an effect on the Apps variable

Continuous Variable

Accept vs. Apps
ggplot(college_data, aes(x = Accept, y = Apps)) +
   geom_point() +
   geom_smooth(method = "loess", color = "red") +
   ggtitle("Accept vs. Apps") +
    theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'

#Correlation check
cor(college_data$Apps, college_data$Accept, method = "pearson")
## [1] 0.9360829
#Normality assumption for pearson correlation
#Hypothesis test: pearson correlation
# H0 : correlation pearson = 0
# H1 : correlation pearson != 0
# p_value < 0.05 reject H0
cor.test(college_data$Apps, college_data$Accept, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  college_data$Apps and college_data$Accept
## S = 1600508, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.9792897

conclusion: Apps and Accept have high correlation.

Enroll vs. Apps
ggplot(college_data, aes(x = Enroll, y = Apps)) +
   geom_point() +
   geom_smooth(method = "loess", color = "red") +
   ggtitle("Enroll vs. Apps") +
    theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'

#Correlation check
cor(college_data$Apps, college_data$Enroll, method = "pearson")
## [1] 0.8750872
cor.test(college_data$Apps, college_data$Enroll, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  college_data$Apps and college_data$Enroll
## S = 5714525, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.9260549

conclusion: Apps and Enroll have high correlation.

Top10perc vs. Apps

ggplot(college_data, aes(x = Top10perc, y = Apps)) +
   geom_point() +
   geom_smooth(method = "loess", color = "red") +
   ggtitle("Top10perc vs. Apps") +
    theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'

#Correlation check
cor(college_data$Apps, college_data$Top10perc, method = "pearson")
## [1] 0.3656929
cor.test(college_data$Apps, college_data$Top10perc, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  college_data$Apps and college_data$Top10perc
## S = 53509557, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.3075946

conclusion: Apps and Top10perc have low correlation.

Top25perc vs. Apps
ggplot(college_data, aes(x = Top25perc, y = Apps)) +
   geom_point() +
   geom_smooth(method = "loess", color = "red") +
   ggtitle("Top25perc vs. Apps") +
    theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'

#Correlation check
cor(college_data$Apps, college_data$Top25perc, method = "pearson")
## [1] 0.3685038
cor.test(college_data$Apps, college_data$Top25perc, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  college_data$Apps and college_data$Top25perc
## S = 47691767, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.3828759

conclusion: Apps and Top25perc have low correlation.

F.Undergrad vs. Apps
ggplot(college_data, aes(x = F.Undergrad, y = Apps)) +
   geom_point() +
   geom_smooth(method = "loess", color = "red") +
   ggtitle("F.Undergrad vs. Apps") +
    theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'

#Correlation check
cor(college_data$Apps, college_data$F.Undergrad, method = "pearson")
## [1] 0.8440157
cor.test(college_data$Apps, college_data$F.Undergrad, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  college_data$Apps and college_data$F.Undergrad
## S = 8761954, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.8866217

conclusion: Apps and F.Undergrad have high correlation.

P.Undergrad vs. Apps
ggplot(college_data, aes(x = P.Undergrad, y = Apps)) +
   geom_point() +
   geom_smooth(method = "loess", color = "red") +
   ggtitle("P.Undergrad vs. Apps") +
    theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'

#Correlation check
cor(college_data$Apps, college_data$P.Undergrad, method = "pearson")
## [1] 0.4084307
cor.test(college_data$Apps, college_data$P.Undergrad, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  college_data$Apps and college_data$P.Undergrad
## S = 46579830, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.3972642

conclusion: Apps and P.Undergrad have low correlation.

Outstate vs. Apps
ggplot(college_data, aes(x = Outstate, y = Apps)) +
   geom_point() +
   geom_smooth(method = "loess", color = "red") +
   ggtitle("Outstate vs. Apps") +
    theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'

#Correlation check
cor(college_data$Apps, college_data$Outstate, method = "pearson")
## [1] 0.06668958
cor.test(college_data$Apps, college_data$Outstate, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  college_data$Apps and college_data$Outstate
## S = 73412241, p-value = 0.1642
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## 0.05005693

conclusion: There is almost no correlation between Apps and Outstate.

Room.Board vs. Apps
ggplot(college_data, aes(x = Room.Board, y = Apps)) +
   geom_point() +
   geom_smooth(method = "loess", color = "red") +
   ggtitle("Room.Board vs. Apps") +
    theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'

#Correlation check
cor(college_data$Apps, college_data$Room.Board, method = "pearson")
## [1] 0.1748123
cor.test(college_data$Apps, college_data$Room.Board, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  college_data$Apps and college_data$Room.Board
## S = 62758050, p-value = 1.389e-07
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.1879205

conclusion: Apps and Room.Board have low correlation.

Books vs. Apps
ggplot(college_data, aes(x = Books, y = Apps)) +
   geom_point() +
   geom_smooth(method = "loess", color = "red") +
   ggtitle("Books vs. Apps") +
    theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'

#Correlation check
cor(college_data$Apps, college_data$Books, method = "pearson")
## [1] 0.1321529
cor.test(college_data$Apps, college_data$Books, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  college_data$Apps and college_data$Books
## S = 59497032, p-value = 9.242e-11
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.2301176

conclusion: Apps and Books have low correlation.

Personal vs. Apps
ggplot(college_data, aes(x = Personal, y = Apps)) +
   geom_point() +
   geom_smooth(method = "loess", color = "red") +
   ggtitle("Personal vs. Apps") +
    theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'

#Correlation check
cor(college_data$Apps, college_data$Personal, method = "pearson")
## [1] 0.1804392
cor.test(college_data$Apps, college_data$Personal, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  college_data$Apps and college_data$Personal
## S = 62413139, p-value = 6.887e-08
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.1923836

conclusion: Apps and Personal have low correlation.

PhD vs. Apps
ggplot(college_data, aes(x = PhD, y = Apps)) +
   geom_point() +
   geom_smooth(method = "loess", color = "red") +
   ggtitle("PhD vs. Apps") +
    theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'

#Correlation check
cor(college_data$Apps, college_data$PhD, method = "pearson")
## [1] 0.4192789
cor.test(college_data$Apps, college_data$PhD, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  college_data$Apps and college_data$PhD
## S = 35269517, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.5436179

conclusion: Apps and PhD have low correlation.

Terminal vs. Apps
ggplot(college_data, aes(x = Terminal, y = Apps)) +
   geom_point() +
   geom_smooth(method = "loess", color = "red") +
   ggtitle("Terminal vs. Apps") +
    theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'

#Correlation check
cor(college_data$Apps, college_data$Terminal, method = "pearson")
## [1] 0.3926116
cor.test(college_data$Apps, college_data$Terminal, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  college_data$Apps and college_data$Terminal
## S = 38550829, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.5011582

conclusion: Apps and Terminal have low correlation.

S.F.Ratio vs. Apps
ggplot(college_data, aes(x = S.F.Ratio, y = Apps)) +
   geom_point() +
   geom_smooth(method = "loess", color = "red") +
   ggtitle("S.F.Ratio vs. Apps") +
    theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'

#Correlation check
cor(college_data$Apps, college_data$S.F.Ratio, method = "pearson")
## [1] 0.08356583
cor.test(college_data$Apps, college_data$S.F.Ratio, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  college_data$Apps and college_data$S.F.Ratio
## S = 60976579, p-value = 3.089e-09
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.2109725

conclusion: There is almost no correlation between Apps and S.F.Ratio.

perc.alumni vs. Apps
ggplot(college_data, aes(x = perc.alumni, y = Apps)) +
   geom_point() +
   geom_smooth(method = "loess", color = "red") +
   ggtitle("perc.alumni vs. Apps") +
    theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'

#Correlation check
cor(college_data$Apps, college_data$perc.alumni, method = "pearson")
## [1] -0.09481275
cor.test(college_data$Apps, college_data$perc.alumni, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  college_data$Apps and college_data$perc.alumni
## S = 83696416, p-value = 0.02089
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.0830187

conclusion: There is almost no correlation between Apps and perc.alumni.

Expend vs. Apps
ggplot(college_data, aes(x = Expend, y = Apps)) +
   geom_point() +
   geom_smooth(method = "loess", color = "red") +
   ggtitle("Expend vs. Apps") +
    theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'

#Correlation check
cor(college_data$Apps, college_data$Expend, method = "pearson")
## [1] 0.283022
cor.test(college_data$Apps, college_data$Expend, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  college_data$Apps and college_data$Expend
## S = 62384116, p-value = 6.487e-08
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.1927592

conclusion: Apps and Expend have low correlation.

Grad.Rate vs. Apps
ggplot(college_data, aes(x = Grad.Rate, y = Apps)) +
   geom_point() +
   geom_smooth(method = "loess", color = "red") +
   ggtitle("Grad.Rate vs. Apps") +
    theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'

#Correlation check
cor(college_data$Apps, college_data$Grad.Rate, method = "pearson")
## [1] 0.1494663
cor.test(college_data$Apps, college_data$Grad.Rate, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  college_data$Apps and college_data$Grad.Rate
## S = 62708263, p-value = 1.257e-07
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.1885648

conclusion: Apps and Grad.Rate have low correlation.

cor_table <- round(cor(college_data[,2:18]),2)
cor_table
##              Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
## Apps         1.00   0.94   0.88      0.37      0.37        0.84        0.41
## Accept       0.94   1.00   0.93      0.20      0.25        0.89        0.45
## Enroll       0.88   0.93   1.00      0.18      0.22        0.96        0.51
## Top10perc    0.37   0.20   0.18      1.00      0.89        0.14       -0.11
## Top25perc    0.37   0.25   0.22      0.89      1.00        0.19       -0.06
## F.Undergrad  0.84   0.89   0.96      0.14      0.19        1.00        0.57
## P.Undergrad  0.41   0.45   0.51     -0.11     -0.06        0.57        1.00
## Outstate     0.07  -0.02  -0.15      0.56      0.49       -0.22       -0.25
## Room.Board   0.17   0.09  -0.04      0.37      0.33       -0.07       -0.06
## Books        0.13   0.11   0.11      0.12      0.12        0.11        0.08
## Personal     0.18   0.20   0.28     -0.10     -0.08        0.31        0.32
## PhD          0.42   0.37   0.33      0.53      0.55        0.32        0.15
## Terminal     0.39   0.35   0.31      0.49      0.52        0.30        0.14
## S.F.Ratio    0.08   0.17   0.23     -0.39     -0.30        0.28        0.23
## perc.alumni -0.09  -0.17  -0.18      0.46      0.42       -0.23       -0.28
## Expend       0.28   0.13   0.06      0.66      0.53        0.02       -0.08
## Grad.Rate    0.15   0.06  -0.03      0.50      0.48       -0.08       -0.26
##             Outstate Room.Board Books Personal   PhD Terminal S.F.Ratio
## Apps            0.07       0.17  0.13     0.18  0.42     0.39      0.08
## Accept         -0.02       0.09  0.11     0.20  0.37     0.35      0.17
## Enroll         -0.15      -0.04  0.11     0.28  0.33     0.31      0.23
## Top10perc       0.56       0.37  0.12    -0.10  0.53     0.49     -0.39
## Top25perc       0.49       0.33  0.12    -0.08  0.55     0.52     -0.30
## F.Undergrad    -0.22      -0.07  0.11     0.31  0.32     0.30      0.28
## P.Undergrad    -0.25      -0.06  0.08     0.32  0.15     0.14      0.23
## Outstate        1.00       0.65  0.04    -0.30  0.39     0.41     -0.55
## Room.Board      0.65       1.00  0.13    -0.20  0.34     0.38     -0.36
## Books           0.04       0.13  1.00     0.18  0.03     0.10     -0.03
## Personal       -0.30      -0.20  0.18     1.00 -0.01    -0.03      0.14
## PhD             0.39       0.34  0.03    -0.01  1.00     0.85     -0.14
## Terminal        0.41       0.38  0.10    -0.03  0.85     1.00     -0.16
## S.F.Ratio      -0.55      -0.36 -0.03     0.14 -0.14    -0.16      1.00
## perc.alumni     0.57       0.27 -0.04    -0.29  0.25     0.27     -0.40
## Expend          0.67       0.50  0.11    -0.10  0.44     0.44     -0.58
## Grad.Rate       0.58       0.42  0.00    -0.27  0.32     0.30     -0.31
##             perc.alumni Expend Grad.Rate
## Apps              -0.09   0.28      0.15
## Accept            -0.17   0.13      0.06
## Enroll            -0.18   0.06     -0.03
## Top10perc          0.46   0.66      0.50
## Top25perc          0.42   0.53      0.48
## F.Undergrad       -0.23   0.02     -0.08
## P.Undergrad       -0.28  -0.08     -0.26
## Outstate           0.57   0.67      0.58
## Room.Board         0.27   0.50      0.42
## Books             -0.04   0.11      0.00
## Personal          -0.29  -0.10     -0.27
## PhD                0.25   0.44      0.32
## Terminal           0.27   0.44      0.30
## S.F.Ratio         -0.40  -0.58     -0.31
## perc.alumni        1.00   0.42      0.49
## Expend             0.42   1.00      0.39
## Grad.Rate          0.49   0.39      1.00

Note: The variable Apps has the most correlation with the variables Accept, Enroll and F.Undergrad. And has the least correlation with Outstate and perc.alumni variables. And High correlations can be seen among the independent variables themselves.

par(mfrow = c(3, 3), mar = c(2, 2, 2, 2))        
 for (i in 3:11) {
      plot(college_data[, i], college_data$Apps, xlab = "", main = paste("Apps vs", names(college_data)[i]))
 }

for (i in 12:18) {
      plot(college_data[, i], college_data$Apps, xlab = "", main = paste("Apps vs", names(college_data)[i]))
}
par(mfrow = c(1, 1))

Note: The Apps variable is linearly related to some variables and non-linearly related to some of them.
Most variables do not follow the normal function, which is not a concern because in regression the variables do not necessarily have to be normal.

Questions of Data understanding

  1. Where is the data obtained and how is it collected?
  2. What do each of the variables measure?
  3. Is there ambiguity in the definitions of the data?
  4. Is there an error in measuring variables or recording data?
  5. What other variables, if any, could help solve the problem?
  6. What kind of variables are there (categorical-numerical)?
  7. What is the statistical summary of available variables?

Answer: The above questions are answered during section A. more details:
Variable A was a bit vague and I did not notice it accurately.
Other variables that could help solve the problem:

  1. Number of scientific articles published by each university.
  2. Number of research-practical works registered under the supervision of the university
  3. Number of scientific conferences held by the university

3.Data Preparation & 4.Modeling

These two steps overlap, so let’s check together.

Questions of Data preparation

  1. Is there a need to integrate data?
  2. Is there a need to clear the data?
  3. Need to convert data?
  4. Is there a need to reduce data?

Answer: A number of very outdated data were deleted, the data was converted and reported.

Dealing with missing valuees

sum(is.na(college_data))
## [1] 0

Note: Fortunately, our dataset does not contain missing values.

Start Modeling

We divide the data set into two parts, train and test. And we build our model on train and check it on the test.

#Divide Dataset into Train and Test
set.seed(12)
train_cases <- sample(1:nrow(college_data), nrow(college_data) * 0.7)
train <- college_data[train_cases, ]
test <- college_data[-train_cases, ]
dim(train)
## [1] 541  18
dim(test)
## [1] 233  18
summary(train)
##  Private        Apps           Accept          Enroll       Top10perc    
##  No :148   Min.   :  100   Min.   :   90   Min.   :  35   Min.   : 1.00  
##  Yes:393   1st Qu.:  776   1st Qu.:  595   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1616   Median : 1159   Median : 438   Median :24.00  
##            Mean   : 3099   Mean   : 2075   Mean   : 810   Mean   :28.25  
##            3rd Qu.: 3877   3rd Qu.: 2603   3rd Qu.: 944   3rd Qu.:36.00  
##            Max.   :20192   Max.   :15096   Max.   :6392   Max.   :95.00  
##    Top25perc       F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.00   Min.   :  282   Min.   :    1.0   Min.   : 2580  
##  1st Qu.: 40.00   1st Qu.: 1022   1st Qu.:   95.0   1st Qu.: 7230  
##  Median : 54.00   Median : 1757   Median :  353.0   Median : 9950  
##  Mean   : 56.21   Mean   : 3830   Mean   :  850.2   Mean   :10423  
##  3rd Qu.: 70.00   3rd Qu.: 4623   3rd Qu.:  982.0   3rd Qu.:12950  
##  Max.   :100.00   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1880   Min.   : 110.0   Min.   : 300   Min.   :  8.00  
##  1st Qu.:3579   1st Qu.: 468.0   1st Qu.: 900   1st Qu.: 62.00  
##  Median :4180   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4359   Mean   : 553.2   Mean   :1351   Mean   : 72.35  
##  3rd Qu.:5062   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :4913   Max.   :100.00  
##     Terminal        S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.00   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 70.00   1st Qu.:11.50   1st Qu.:14.00   1st Qu.: 6786  
##  Median : 81.00   Median :13.70   Median :21.00   Median : 8328  
##  Mean   : 79.42   Mean   :14.15   Mean   :22.97   Mean   : 9617  
##  3rd Qu.: 92.00   3rd Qu.:16.60   3rd Qu.:31.00   3rd Qu.:10864  
##  Max.   :100.00   Max.   :39.80   Max.   :64.00   Max.   :45702  
##    Grad.Rate    
##  Min.   : 10.0  
##  1st Qu.: 54.0  
##  Median : 65.0  
##  Mean   : 65.9  
##  3rd Qu.: 78.0  
##  Max.   :100.0
summary(test)
##  Private        Apps           Accept          Enroll         Top10perc    
##  No : 62   Min.   :   81   Min.   :   72   Min.   :  46.0   Min.   : 1.00  
##  Yes:171   1st Qu.:  785   1st Qu.:  632   1st Qu.: 242.0   1st Qu.:15.00  
##            Median : 1444   Median : 1086   Median : 417.0   Median :22.00  
##            Mean   : 2588   Mean   : 1785   Mean   : 697.5   Mean   :26.02  
##            3rd Qu.: 3073   3rd Qu.: 2042   3rd Qu.: 776.0   3rd Qu.:31.00  
##            Max.   :21804   Max.   :18744   Max.   :5874.0   Max.   :96.00  
##    Top25perc       F.Undergrad     P.Undergrad         Outstate    
##  Min.   : 13.00   Min.   :  139   Min.   :    3.0   Min.   : 2340  
##  1st Qu.: 43.00   1st Qu.:  943   1st Qu.:   95.0   1st Qu.: 7560  
##  Median : 54.00   Median : 1656   Median :  355.0   Median :10217  
##  Mean   : 54.87   Mean   : 3345   Mean   :  861.5   Mean   :10523  
##  3rd Qu.: 66.00   3rd Qu.: 3499   3rd Qu.:  871.0   3rd Qu.:12850  
##  Max.   :100.00   Max.   :30017   Max.   :10962.0   Max.   :20100  
##    Room.Board       Books           Personal         PhD       
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   : 10.0  
##  1st Qu.:3620   1st Qu.: 470.0   1st Qu.: 800   1st Qu.: 62.0  
##  Median :4270   Median : 500.0   Median :1200   Median : 76.0  
##  Mean   :4357   Mean   : 539.5   Mean   :1321   Mean   : 73.4  
##  3rd Qu.:5045   3rd Qu.: 600.0   3rd Qu.:1675   3rd Qu.: 87.0  
##  Max.   :6950   Max.   :1400.0   Max.   :6800   Max.   :100.0  
##     Terminal        S.F.Ratio      perc.alumni        Expend     
##  Min.   : 33.00   Min.   : 3.30   Min.   : 0.00   Min.   : 3480  
##  1st Qu.: 72.00   1st Qu.:11.40   1st Qu.:11.00   1st Qu.: 6744  
##  Median : 85.00   Median :13.50   Median :20.00   Median : 8539  
##  Mean   : 80.39   Mean   :13.92   Mean   :22.27   Mean   : 9778  
##  3rd Qu.: 91.00   3rd Qu.:15.90   3rd Qu.:31.00   3rd Qu.:10779  
##  Max.   :100.00   Max.   :27.80   Max.   :60.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 15.00  
##  1st Qu.: 52.00  
##  Median : 65.00  
##  Mean   : 64.26  
##  3rd Qu.: 77.00  
##  Max.   :100.00

Note: train and test data distribution is close together which is good.

Model 1: Tranditional liner Regression

Bulding prediction model
m1 <- lm(Apps ~ . , data = train)
summary(m1)
## 
## Call:
## lm(formula = Apps ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3854.9  -437.0   -83.7   287.1  6268.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -681.06298  494.40443  -1.378 0.168934    
## PrivateYes  -523.96480  162.09226  -3.233 0.001304 ** 
## Accept         1.29079    0.05975  21.603  < 2e-16 ***
## Enroll        -0.22694    0.20714  -1.096 0.273744    
## Top10perc     46.83186    6.74940   6.939 1.17e-11 ***
## Top25perc    -15.04953    5.32878  -2.824 0.004921 ** 
## F.Undergrad    0.05390    0.03601   1.497 0.135030    
## P.Undergrad    0.05653    0.03577   1.580 0.114609    
## Outstate      -0.06529    0.02302  -2.836 0.004742 ** 
## Room.Board     0.13355    0.05485   2.435 0.015237 *  
## Books         -0.15121    0.25937  -0.583 0.560143    
## Personal      -0.01000    0.07752  -0.129 0.897354    
## PhD          -10.65127    5.46434  -1.949 0.051802 .  
## Terminal      -1.93004    5.86554  -0.329 0.742252    
## S.F.Ratio     17.91148   15.11256   1.185 0.236474    
## perc.alumni   -4.07861    4.77966  -0.853 0.393869    
## Expend         0.10921    0.01680   6.500 1.88e-10 ***
## Grad.Rate     12.81339    3.62074   3.539 0.000438 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 999 on 523 degrees of freedom
## Multiple R-squared:  0.9263, Adjusted R-squared:  0.9239 
## F-statistic: 386.5 on 17 and 523 DF,  p-value: < 2.2e-16

Note: Model summary results

  1. p_value < 0.05 : The result of statistical test F shows that there is a linear relationship between the dependent variable and the independent variables.
  2. The results show that we cannot keep variables that do not have * or ** or *** inside the model. And how reliable the results of the statistical test T are is that there is no problem of Multicollinearity between the variables.
  3. R-squared and Adjusted R-squared show how well the model shows the relationship between the variables. The closer it is to 1, the better.
  4. Residuals: Represents a summary of the status of errors.

Conclusion: Variables that do not perform well in the statistical test T are removed from the model.

Bulding prediction new model
m1_1 <- lm(Apps ~ Private + Accept + Top10perc +  Outstate + Room.Board + Expend + Grad.Rate , data = train)
summary(m1_1)
## 
## Call:
## lm(formula = Apps ~ Private + Accept + Top10perc + Outstate + 
##     Room.Board + Expend + Grad.Rate, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3929.4  -472.8   -48.6   304.7  6415.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.283e+03  2.185e+02  -5.874 7.50e-09 ***
## PrivateYes  -4.630e+02  1.478e+02  -3.132  0.00183 ** 
## Accept       1.303e+00  2.374e-02  54.878  < 2e-16 ***
## Top10perc    2.705e+01  3.700e+00   7.311 9.75e-13 ***
## Outstate    -9.101e-02  2.159e-02  -4.216 2.92e-05 ***
## Room.Board   1.325e-01  5.290e-02   2.504  0.01256 *  
## Expend       1.072e-01  1.493e-02   7.176 2.42e-12 ***
## Grad.Rate    8.989e+00  3.417e+00   2.631  0.00876 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1016 on 533 degrees of freedom
## Multiple R-squared:  0.9223, Adjusted R-squared:  0.9213 
## F-statistic: 903.9 on 7 and 533 DF,  p-value: < 2.2e-16

Note: The condition of the model is good based on the previous statements, but the assumptions of the regression model should also be considered.
We can generalize the above results to the community when the following hypotheses are met.
Regression has two main assumptions:

  1. Residuals follow the normal distribution with a mean of 0 and standard deviation sigma ^ 2.
  2. Residuals are independent of each other.
#Check Assumption of Regression
##Normality of residuals
###Histogram
hist(m1_1$residuals, probability = T)
lines(density(m1$residuals), col = "red")

###QQplot
qqnorm(m1_1$residuals , main = "QQ_plot of residuals", pch = 20)
qqline(m1_1$residuals, col = "red")

###Test Skewness and Kurtosis
moments :: jarque.test(m1_1$residuals)
## 
##  Jarque-Bera Normality Test
## 
## data:  m1_1$residuals
## JB = 3442.5, p-value < 2.2e-16
## alternative hypothesis: greater
          anscombe.test(m1_1$residuals)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  m1_1$residuals
## kurt = 14.538, z = 10.741, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3

Conclusion: Residuals are not normally distributed. So the first assumption was rejected.

#Diagnostic plots
plot(m1_1)

Note: Diagnostic plot show:

  1. first plot: Shows how our Residuals are distributed based on Y^. And helps to identify heteroscedasticity of variance.
  2. second plot: QQ plot Which is used to check for normality.
  3. third plot: Shows the relationship between the square root of the absolute magnitude of Residuals and Y^. Which should be smooth.
  4. fourth plot: Perfect diagram for Outlier detection. If there is a problem, an error occurs called Cook’s distance, which is higher than 1, ie our model has a problem.

conclusion: There is a problem of heteroscedasticity.

#Check Multicollinearity
car :: vif(m1)
##     Private      Accept      Enroll   Top10perc   Top25perc F.Undergrad 
##    2.830423   10.642178   21.103948    7.983988    6.282830   16.880319 
## P.Undergrad    Outstate  Room.Board       Books    Personal         PhD 
##    1.658309    4.671284    2.082366    1.098323    1.336526    4.344837 
##    Terminal   S.F.Ratio perc.alumni      Expend   Grad.Rate 
##    4.207430    2.005022    1.868395    3.728812    2.003254
car :: vif(m1_1)
##    Private     Accept  Top10perc   Outstate Room.Board     Expend  Grad.Rate 
##   2.276805   1.624296   2.320346   3.972856   1.873046   2.849304   1.725085

Note: If the numbers are less than 10, there is no problem of multicolinearity.
conclusion: The model does not have the problem of multicolinearity and the results of the statistical test T can be trusted.

This model has violated the assumptions of the regression model, so it seems to be a bad model.

Test the model
#prediction
pred_m1_1 <- predict(m1_1, test)
#Absolute error
abs_err_m1_1 <- abs(pred_m1_1 - test$Apps)
summary(abs_err_m1_1)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    2.099  199.481  390.507  596.973  709.045 6972.844
sd(abs_err_m1_1)
## [1] 773.7654
#Histogram and Boxplot
hist(abs_err_m1_1)

boxplot(abs_err_m1_1)

Comparisions of models
 df_err <- data.frame("m1_1" = abs_err_m1_1 ) 
  models_comp <- data.frame("mean of Abs Error"   = apply(df_err, 2, mean  ),
                           "median of Abs Error" = apply(df_err, 2, median),
                           "SD of Abs Error"     = apply(df_err, 2, sd    ),
                           "min of Abs Error"    = apply(df_err, 2, min   ),
                           "max of Abs Error"    = apply(df_err, 2, max)  )    
 models_comp
##      mean.of.Abs.Error median.of.Abs.Error SD.of.Abs.Error min.of.Abs.Error
## m1_1          596.9731            390.5073        773.7654         2.098694
##      max.of.Abs.Error
## m1_1         6972.844

Note: We create a data frame to compare models.

Model 2: Box_Cox Transformation

When the data is skewed, we transformation the variable so that the data distribution follows the normal.

  1. transformed_x = (x ^ lambda -1) / lambda if lambda != 0
  2. transformed_x = log(x) if lambda = 0
#find best lambda for boxcox transformation
library("MASS")
box_results <- boxcox(Apps ~ ., data = train, lambda = seq(-5, 5, 0.1))

box_results <- data.frame(box_results$x, box_results$y)
lambda <- box_results[which(box_results$box_results.y == max(box_results$box_results.y)), 1]
lambda
## [1] 0.5
#create boxcox_Apps variable
train$boxcox_Apps <- (train$Apps ^ 0.5 - 1) / 0.5
#check normality
##Histogram
hist(train$boxcox_Apps, probability = TRUE, breaks = 15)
lines(density(train$boxcox_Apps), col = "red")

##QQplot
qqnorm(train$boxcox_Apps, main = "QQ_plot", pch = 20)
qqline(train$boxcox_Apps, col = "red")

##Test for skewness and kurtosis
jarque.test(train$boxcox_Apps)
## 
##  Jarque-Bera Normality Test
## 
## data:  train$boxcox_Apps
## JB = 125.36, p-value < 2.2e-16
## alternative hypothesis: greater
anscombe.test(train$boxcox_Apps)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  train$boxcox_Apps
## kurt = 3.6693, z = 2.6295, p-value = 0.008551
## alternative hypothesis: kurtosis is not equal to 3

Conclusion: There was a lot of skewness and we could not convert it to Normal Distribution by Boxcox Transformation.

Bulding prediction model
m2 <- lm(boxcox_Apps ~ . - Apps, data = train)
summary(m2)
## 
## Call:
## lm(formula = boxcox_Apps ~ . - Apps, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.741  -9.757  -0.242   8.710  64.487 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.313e+01  8.324e+00  -1.577  0.11536    
## PrivateYes  -1.668e+01  2.729e+00  -6.111 1.93e-09 ***
## Accept       1.779e-02  1.006e-03  17.679  < 2e-16 ***
## Enroll       7.607e-04  3.488e-03   0.218  0.82741    
## Top10perc    4.770e-01  1.136e-01   4.198 3.17e-05 ***
## Top25perc   -8.633e-02  8.972e-02  -0.962  0.33640    
## F.Undergrad -6.522e-04  6.063e-04  -1.076  0.28252    
## P.Undergrad  1.728e-03  6.022e-04   2.869  0.00428 ** 
## Outstate    -2.256e-04  3.876e-04  -0.582  0.56078    
## Room.Board   2.515e-03  9.236e-04   2.723  0.00669 ** 
## Books        6.401e-03  4.367e-03   1.466  0.14331    
## Personal     1.290e-03  1.305e-03   0.988  0.32357    
## PhD          4.105e-02  9.200e-02   0.446  0.65567    
## Terminal     1.073e-01  9.876e-02   1.087  0.27763    
## S.F.Ratio    1.218e+00  2.545e-01   4.787 2.21e-06 ***
## perc.alumni -1.147e-01  8.048e-02  -1.426  0.15456    
## Expend       1.760e-03  2.829e-04   6.222 1.01e-09 ***
## Grad.Rate    2.750e-01  6.096e-02   4.511 7.96e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.82 on 523 degrees of freedom
## Multiple R-squared:  0.9091, Adjusted R-squared:  0.9061 
## F-statistic: 307.5 on 17 and 523 DF,  p-value: < 2.2e-16

Conclusion: Variables that do not perform well in the statistical test T are removed from the model.

m2_1 <- lm(boxcox_Apps ~ Private + Accept + Top10perc + P.Undergrad + Room.Board + S.F.Ratio + Expend + Grad.Rate , data = train)
summary(m2_1)
## 
## Call:
## lm(formula = boxcox_Apps ~ Private + Accept + Top10perc + P.Undergrad + 
##     Room.Board + S.F.Ratio + Expend + Grad.Rate, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.146  -9.852  -0.214   8.722  66.601 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.012e+00  6.259e+00  -0.481 0.630533    
## PrivateYes  -1.847e+01  2.336e+00  -7.908 1.52e-14 ***
## Accept       1.719e-02  4.135e-04  41.576  < 2e-16 ***
## Top10perc    3.956e-01  6.113e-02   6.473 2.19e-10 ***
## P.Undergrad  1.658e-03  5.612e-04   2.955 0.003262 ** 
## Room.Board   2.924e-03  8.247e-04   3.545 0.000426 ***
## S.F.Ratio    1.240e+00  2.518e-01   4.924 1.14e-06 ***
## Expend       1.840e-03  2.564e-04   7.176 2.42e-12 ***
## Grad.Rate    2.291e-01  5.570e-02   4.113 4.53e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.87 on 532 degrees of freedom
## Multiple R-squared:  0.907,  Adjusted R-squared:  0.9056 
## F-statistic: 648.3 on 8 and 532 DF,  p-value: < 2.2e-16
#Check Assumptions of Regression
plot(m2_1)

Conclusion: Residuals are not normally distributed. So the first assumption was rejected. and There is a problem of heteroscedasticity.

#Check Multicollinearity
car :: vif(m2_1)
##     Private      Accept   Top10perc P.Undergrad  Room.Board   S.F.Ratio 
##    2.061826    1.787393    2.296774    1.431806    1.650965    1.952909 
##      Expend   Grad.Rate 
##    3.045916    1.662579

conclusion: The model does not have the problem of multicolinearity and the results of the statistical test can be trusted.

Test the model
#prediction
pred_m2_1 <- predict(m2_1, test)
pred_m2_1 <- (pred_m2_1 + 2 / 2) ^ 2 
#Absolute error
abs_err_m2_1 <- abs(pred_m2_1 - test$Apps)
summary(abs_err_m2_1)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     44.96   2635.42   4533.46   7667.82   8234.31 141274.20
sd(abs_err_m2_1)
## [1] 12179.97
#Histogram and Boxplot
hist(abs_err_m2_1)

boxplot(abs_err_m2_1)

Conclusion:Box_Cox Transformation makes the Residuals situation worse. And this model is bad.

Comparisions of models
 df_err <- data.frame("m1_1" = abs_err_m1_1,
                      "m2_1" = abs_err_m2_1) 

  models_comp <- data.frame("mean of Abs Error"   = apply(df_err, 2, mean  ),
                           "median of Abs Error" = apply(df_err, 2, median),
                           "SD of Abs Error"     = apply(df_err, 2, sd    ),
                           "min of Abs Error"    = apply(df_err, 2, min   ),
                           "max of Abs Error"    = apply(df_err, 2, max)  )    
 models_comp
##      mean.of.Abs.Error median.of.Abs.Error SD.of.Abs.Error min.of.Abs.Error
## m1_1          596.9731            390.5073        773.7654         2.098694
## m2_1         7667.8177           4533.4558      12179.9685        44.958069
##      max.of.Abs.Error
## m1_1         6972.844
## m2_1       141274.196

conclusion: By far, Model 1 is better than Model 2.

head(train)
##     Private  Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
## 451     Yes   692    576    174        19        50         597          83
## 347     Yes   499    441    199        26        52         846         377
## 337     Yes   874    758    428        21        46        1605         246
## 248     Yes   467    424    350        16        40        1365         334
## 175     Yes 13789   3893   1583        90        98        6188          53
## 454     Yes  1133    630    220        37        73         750          30
##     Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni
## 451    10500       3860   600      940  58       64      11.6          19
## 347    11200       7400   600     1300  66       79       6.8          50
## 337     9858       3700   450     1200  42       45      17.6          16
## 248     6300       2980   700     2140  75       79      13.7          10
## 175    18590       5950   625     1162  95       96       5.0          44
## 454    17688       5900   650      850 100      100      10.4          11
##     Expend Grad.Rate boxcox_Apps
## 451   8990        39    50.61179
## 347  10819        90    42.67662
## 337   4796        55    57.12698
## 248   7054        38    41.22037
## 175  27206        97   232.85315
## 454  14820        73    65.32013
train <- train[,-19]

Model 3: using k_fold cross_validation approach

Display train data evenly in k folders. and We have 17 models and we create a matrix that gets the model error based on each folder.

library("leaps")
k <- 10
set.seed(123)
folds <- sample(1:k, nrow(train), rep = TRUE)
cv_errors <- matrix(NA, k, 17, dimnames = list(NULL, paste(1:17)))
cv_errors
##        1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
##  [1,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [2,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [3,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [4,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [5,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [6,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [7,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [8,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [9,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [10,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Bulding prediction model
predict_regsubsets <- function(object, newdata, id){
   reg_formula <- as.formula(object$call[[2]])
   mat <- model.matrix(reg_formula, newdata)
   coef_i <- coef(object, id = id)
   mat[, names(coef_i)] %*% coef_i
}

for (i in 1:k) {
   best_fit <- regsubsets(Apps ~ ., data = train[folds != i, ], nvmax = 17)
   for ( j in 1:17) {
      pred <- predict_regsubsets(best_fit, newdata = train[folds == i,], id = j)
      cv_errors[i, j] <- mean((train$Apps[folds == i] - pred) ^ 2)
   }
}
cv_errors
##               1         2         3         4         5         6         7
##  [1,] 1147040.2  734013.3  622946.9  664345.8  682101.5  656489.3  723483.1
##  [2,]  711738.7  729031.0  668876.8  645100.8  583021.2  570368.5  611519.2
##  [3,] 1726947.5 1189231.2 1265115.0 1384921.8 1353493.5 1410367.1 1405397.3
##  [4,]  976642.2  944835.8  929006.1  890367.3  935175.4  992716.3  945305.6
##  [5,]  491653.6  605304.1  592482.7  556010.3  560040.3  547570.7  539480.4
##  [6,] 3078756.3 1821222.4 1701720.1 1617094.8 1535879.3 1544437.9 1485121.0
##  [7,] 1491829.2  941344.3  944381.4  663933.4  702487.8  679105.3  663658.7
##  [8,] 2375703.4 1653448.9 1700841.8 1646361.3 1682634.6 1682288.6 1625430.3
##  [9,] 2822617.7 1834083.7 1656356.6 1630456.6 1640019.6 1632766.4 1620522.9
## [10,] 1484201.6 1256031.3 1282608.3 1218994.2 1225445.2 1204905.1 1189939.5
##               8         9        10        11        12        13        14
##  [1,]  681032.1  673576.3  645447.8  657104.1  668861.8  668284.0  661776.0
##  [2,]  578855.8  562382.3  553115.2  548426.2  557424.8  552990.5  561225.4
##  [3,] 1390953.1 1335829.7 1313677.2 1341407.8 1322889.0 1309580.7 1312246.5
##  [4,]  983446.9  919248.6  961848.3  964068.8  946920.9  942806.7 1037027.8
##  [5,]  514616.3  469771.1  498348.6  509011.6  524459.0  511355.5  518352.7
##  [6,] 1445470.8 1398775.5 1447689.9 1502249.5 1487995.1 1493032.8 1485943.9
##  [7,]  659452.2  586361.8  587831.8  588993.1  586853.2  575549.6  573131.6
##  [8,] 1597411.4 1513743.1 1543913.4 1592400.4 1596843.6 1635225.7 1630923.6
##  [9,] 1613383.9 1552481.4 1523411.7 1558764.7 1542918.2 1530932.2 1540417.4
## [10,] 1218153.7 1194336.2 1143673.1 1157847.6 1158205.7 1155268.8 1163186.7
##              15        16        17
##  [1,]  650638.6  655356.8  655963.9
##  [2,]  556175.6  557230.3  556015.5
##  [3,] 1321141.9 1327150.8 1327556.4
##  [4,] 1032288.5 1036177.6 1035327.7
##  [5,]  492829.8  491480.0  491881.2
##  [6,] 1482149.3 1484401.1 1486444.4
##  [7,]  578544.4  578259.8  575516.3
##  [8,] 1627024.7 1632870.7 1631921.2
##  [9,] 1535589.5 1534823.3 1539092.6
## [10,] 1160835.7 1164056.5 1167395.5

Note:It was modeled 10 times for each and its errors were reported.

mean_cv_errors <- apply(cv_errors, 2, mean)
mean_cv_errors
##       1       2       3       4       5       6       7       8       9      10 
## 1630713 1170855 1136434 1091759 1090030 1092102 1080986 1068278 1020651 1021896 
##      11      12      13      14      15      16      17 
## 1042027 1039337 1037503 1048423 1043722 1046181 1046711
plot(mean_cv_errors, type = "b")

which.min(mean_cv_errors)
## 9 
## 9

Note: We get the average of each column to find out which model is the best. And A model with 9 variables is the best.

m3 <- regsubsets(Apps ~ ., data = train, nvmax = 17)
coef(m3,9)
##   (Intercept)    PrivateYes        Accept     Top10perc     Top25perc 
## -386.77781410 -630.44356361    1.31896484   45.36497501  -14.46736061 
##      Outstate    Room.Board           PhD        Expend     Grad.Rate 
##   -0.07683097    0.15147886  -10.84808271    0.10042862   10.59897483

conclusion: We build a new model with these variables

#Model making
m3_1 <- lm(Apps ~ Private + Accept + Top10perc + Top25perc + Outstate + Room.Board + PhD + Expend +Grad.Rate , data = train)
summary(m3_1)
## 
## Call:
## lm(formula = Apps ~ Private + Accept + Top10perc + Top25perc + 
##     Outstate + Room.Board + PhD + Expend + Grad.Rate, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3753.8  -436.5   -58.5   293.0  6119.6 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -386.77781  298.75728  -1.295 0.196013    
## PrivateYes  -630.44356  153.44125  -4.109 4.61e-05 ***
## Accept         1.31896    0.02367  55.733  < 2e-16 ***
## Top10perc     45.36498    6.60507   6.868 1.82e-11 ***
## Top25perc    -14.46736    5.21210  -2.776 0.005702 ** 
## Outstate      -0.07683    0.02151  -3.572 0.000386 ***
## Room.Board     0.15148    0.05235   2.894 0.003963 ** 
## PhD          -10.84808    3.58658  -3.025 0.002610 ** 
## Expend         0.10043    0.01523   6.595 1.03e-10 ***
## Grad.Rate     10.59897    3.38373   3.132 0.001830 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1000 on 531 degrees of freedom
## Multiple R-squared:  0.925,  Adjusted R-squared:  0.9237 
## F-statistic: 727.6 on 9 and 531 DF,  p-value: < 2.2e-16
#Check Assumptions of Regression
plot(m3_1)

Note: There is a problem of heteroscedasticity.

Test the model
#Prediction
pred_m3_1 <- predict(m3_1, test)
#Absolute error
abs_err_m3_1 <- abs(pred_m3_1 - test$Apps)
summary(abs_err_m3_1)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.237  171.543  398.241  596.230  692.421 6604.283
sd(abs_err_m3_1)
## [1] 755.0726
#Histogram and Boxplot
hist(abs_err_m3_1)

boxplot(abs_err_m3_1)

Comparisions of models
 df_err <- data.frame("m1_1" = abs_err_m1_1,
                      "m2_1" = abs_err_m2_1,
                      "m3_1" = abs_err_m3_1) 

 models_comp <- data.frame("mean of Abs Error"   = apply(df_err, 2, mean  ),
                           "median of Abs Error" = apply(df_err, 2, median),
                           "SD of Abs Error"     = apply(df_err, 2, sd    ),
                           "min of Abs Error"    = apply(df_err, 2, min   ),
                           "max of Abs Error"    = apply(df_err, 2, max)  )    
 models_comp
##      mean.of.Abs.Error median.of.Abs.Error SD.of.Abs.Error min.of.Abs.Error
## m1_1          596.9731            390.5073        773.7654        2.0986942
## m2_1         7667.8177           4533.4558      12179.9685       44.9580688
## m3_1          596.2298            398.2408        755.0726        0.2370317
##      max.of.Abs.Error
## m1_1         6972.844
## m2_1       141274.196
## m3_1         6604.283
#boxplot of absolute errors
boxplot(df_err, main = "Abs.Error Dist. of models")

conclusion: Model 1 and Model 3 are similar. And Model 2 is inefficient.

Model 4: Ridge Regression

Regularization
Ridge Regression:
The goal is the optimize:
RSS + lambda * sum(beta_i ^ 2)
lambda >= 0, a tuning parameter

Choose the best lambda

Note: for Ridge Regression must making model.matrix()

x <- model.matrix(Apps ~ . , data = train)[, -1] #Remove intercepts
y <- train$Apps
#Apply ridge regression
library("glmnet")
## Loading required package: Matrix
## Loaded glmnet 4.0-2
m4_ridge <- glmnet(x, y, alpha = 0)
dim(coef(m4_ridge))
## [1]  18 100

Note: R creates lambda. We have 100 lambda
If alpha = 0 then create Ridge Regression model.
If alpha = 1 then create Lasso model.

plot(m4_ridge, xvar = "lambda")

Note:The larger the lambda, the closer the regression coefficients are to 0 but never to 0.

ridge_cv <- cv.glmnet(x, y, alpha = 0)
best_lambda <- ridge_cv$lambda.min
best_lambda
## [1] 338.1328

Conclusion: the best lambda is 338.13 .

Bulding prediction model
ridge_model <- glmnet(x, y, alpha = 0, lambda = best_lambda)
coef(ridge_model)
## 18 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) -1.616039e+03
## PrivateYes  -5.476519e+02
## Accept       7.684774e-01
## Enroll       6.658907e-01
## Top10perc    2.558789e+01
## Top25perc   -1.338570e+00
## F.Undergrad  1.075362e-01
## P.Undergrad  2.563345e-02
## Outstate    -1.213134e-03
## Room.Board   1.848040e-01
## Books       -1.586445e-02
## Personal    -2.188587e-02
## PhD         -3.418311e+00
## Terminal    -3.757458e+00
## S.F.Ratio    1.245465e+01
## perc.alumni -1.238471e+01
## Expend       9.836593e-02
## Grad.Rate    1.292104e+01

Note: Regression coefficients were obtained in proportion to the best lambda.

Test the model
x_test <- model.matrix(Apps ~ . , data = test)[, -1]
pred_m4_ridge <- predict(m4_ridge, s = best_lambda, newx = x_test)
pred_m4_ridge <- c(pred_m4_ridge)
abs_err_m4_ridge <- abs(pred_m4_ridge - test$Apps)
summary(abs_err_m4_ridge)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    5.108  140.511  405.389  637.808  822.121 6530.076
sd(abs_err_m4_ridge)
## [1] 814.0771
#Histogram and Boxplot
hist(abs_err_m4_ridge) 

boxplot(abs_err_m4_ridge)

Comparisions of models
 df_err <- data.frame("m1_1"     = abs_err_m1_1,
                      "m3_1"     = abs_err_m3_1,
                      "m4_ridge" = abs_err_m4_ridge)

 models_comp <- data.frame("mean of Abs Error"   = apply(df_err, 2, mean  ),
                           "median of Abs Error" = apply(df_err, 2, median),
                           "SD of Abs Error"     = apply(df_err, 2, sd    ),
                           "min of Abs Error"    = apply(df_err, 2, min   ),
                           "max of Abs Error"    = apply(df_err, 2, max)  )    
 models_comp
##          mean.of.Abs.Error median.of.Abs.Error SD.of.Abs.Error min.of.Abs.Error
## m1_1              596.9731            390.5073        773.7654        2.0986942
## m3_1              596.2298            398.2408        755.0726        0.2370317
## m4_ridge          637.8083            405.3886        814.0771        5.1078962
##          max.of.Abs.Error
## m1_1             6972.844
## m3_1             6604.283
## m4_ridge         6530.076

Conclusion: The new model is not better than the previous two models.

#boxplot of absolute errors
boxplot(df_err, main = "Abs.Error Dist. of models")

conclusion: m1_1 and m3_1 models are the best.

Model 5: Lasso Regression

Regularization
Lasso Regression:
The goal is the optimize:
RSS + lambda * sum(abs(beta_i))
lambda >= 0, a tuning parameter

Note: Lasso Regression is used when not all variables are good and we want to select the best variables to build the model.

Choose the best lambda
#Apply lasso regression
m5_lasso <- glmnet(x, y, alpha = 1)
dim(coef(m5_lasso))
## [1] 18 81

Note: R creates lambda. We have 81 lambda.

plot(m5_lasso, xvar = "lambda")

lasso_cv <- cv.glmnet(x, y, alpha = 1)
best_lambda <- lasso_cv$lambda.min
best_lambda
## [1] 8.774628

Conclusion: the best lambda is 8.77 .

Bulding prediction model
lasso_model <- glmnet(x, y, alpha = 1, lambda = best_lambda)
coef(lasso_model)
## 18 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) -809.48674882
## PrivateYes  -515.76727715
## Accept         1.25061635
## Enroll         .         
## Top10perc     39.83066075
## Top25perc     -9.76835696
## F.Undergrad    0.02828286
## P.Undergrad    0.04658106
## Outstate      -0.05191041
## Room.Board     0.11921218
## Books         -0.06515242
## Personal       .         
## PhD           -8.33462968
## Terminal      -2.40392165
## S.F.Ratio     12.91780479
## perc.alumni   -4.51089751
## Expend         0.10401740
## Grad.Rate     11.17444304

Note: Variables Enroll and Personal are not used in model construction.

Test the model
pred_m5_lasso <- predict(m5_lasso, s = best_lambda, newx = x_test)
pred_m5_lasso <- c(pred_m5_lasso)
abs_err_m5_lasso <- abs(pred_m5_lasso - test$Apps)
summary(abs_err_m5_lasso)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.78  176.74  354.43  582.60  691.62 6669.74
sd(abs_err_m5_lasso)
## [1] 757.9307
#Histogram and Boxplot
hist(abs_err_m5_lasso) 

boxplot(abs_err_m5_lasso)

Comparisions of models
 df_err <- data.frame("m1_1"     = abs_err_m1_1,
                      "m3_1"     = abs_err_m3_1,
                      "m4_ridge" = abs_err_m4_ridge,
                      "m5_lasso" = abs_err_m5_lasso)

 models_comp <- data.frame("mean of Abs Error"   = apply(df_err, 2, mean  ),
                           "median of Abs Error" = apply(df_err, 2, median),
                           "SD of Abs Error"     = apply(df_err, 2, sd    ),
                           "min of Abs Error"    = apply(df_err, 2, min   ),
                           "max of Abs Error"    = apply(df_err, 2, max)  )    
 models_comp
##          mean.of.Abs.Error median.of.Abs.Error SD.of.Abs.Error min.of.Abs.Error
## m1_1              596.9731            390.5073        773.7654        2.0986942
## m3_1              596.2298            398.2408        755.0726        0.2370317
## m4_ridge          637.8083            405.3886        814.0771        5.1078962
## m5_lasso          582.5965            354.4272        757.9307        2.7801727
##          max.of.Abs.Error
## m1_1             6972.844
## m3_1             6604.283
## m4_ridge         6530.076
## m5_lasso         6669.743

Note: Lasso model works a little better.

#boxplot of absolute errors
boxplot(df_err, main = "Abs.Error Dist. of models")

BootStraping: Bagging and Random Forest

  • Bagging : Uses all variables to construct a prediction model.
  • Random Forest : Uses a random subset of variables to construct a prediction model.

Model 6: Bagging

Bulding prediction model
library("randomForest")
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
m6_bagging <- randomForest(Apps ~ ., mtry = ncol(train) - 1, data = train)
Test the model
pred_m6_bagging <- predict(m6_bagging,newdata = test)
abs_err_6_bagging <- abs(pred_m6_bagging - test$Apps)
summary(abs_err_6_bagging)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.419   56.930  168.245  446.009  546.522 6774.290
sd(abs_err_6_bagging)
## [1] 796.3811
Comparisions of models
 df_err <- data.frame("m1_1"     = abs_err_m1_1,
                      "m3_1"     = abs_err_m3_1,
                      "m4_ridge" = abs_err_m4_ridge,
                      "m5_lasso" = abs_err_m5_lasso,
                      "m6_bagging" = abs_err_6_bagging)

 models_comp <- data.frame("mean of Abs Error"   = apply(df_err, 2, mean  ),
                           "median of Abs Error" = apply(df_err, 2, median),
                           "SD of Abs Error"     = apply(df_err, 2, sd    ),
                           "min of Abs Error"    = apply(df_err, 2, min   ),
                           "max of Abs Error"    = apply(df_err, 2, max)  )    
 models_comp
##            mean.of.Abs.Error median.of.Abs.Error SD.of.Abs.Error
## m1_1                596.9731            390.5073        773.7654
## m3_1                596.2298            398.2408        755.0726
## m4_ridge            637.8083            405.3886        814.0771
## m5_lasso            582.5965            354.4272        757.9307
## m6_bagging          446.0091            168.2453        796.3811
##            min.of.Abs.Error max.of.Abs.Error
## m1_1              2.0986942         6972.844
## m3_1              0.2370317         6604.283
## m4_ridge          5.1078962         6530.076
## m5_lasso          2.7801727         6669.743
## m6_bagging        0.4188667         6774.290

Note: Bagging works better than other models.

#boxplot of absolute errors
boxplot(df_err, main = "Abs.Error Dist. of models")

Model 7: RandomForest

Bulding prediction model
#create random forest model 
m7_rf <- randomForest(Apps ~ ., data = train, importance = TRUE)
m7_rf
## 
## Call:
##  randomForest(formula = Apps ~ ., data = train, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 939736.9
##                     % Var explained: 92.82

Note: Our forest is made up of 500 trees, which explains 92% of the variance.

#model interpretation
importance(m7_rf)
##               %IncMSE IncNodePurity
## Private      6.238110     156795273
## Accept      27.821061    2417952619
## Enroll      19.193297    1782445001
## Top10perc    9.676049     209983208
## Top25perc   12.249743     193720006
## F.Undergrad 15.647214    1196008946
## P.Undergrad  5.412150     147035186
## Outstate    13.462204     138037408
## Room.Board  11.312324      84674014
## Books        1.912405      28965936
## Personal     3.359643      40517105
## PhD          7.210371     177730891
## Terminal     7.041672     139776678
## S.F.Ratio    8.124410      52224570
## perc.alumni  2.176190      30647289
## Expend       9.664250     174237461
## Grad.Rate    9.677273     107670091
varImpPlot(m7_rf)

Note:model interpretation

  • %IncMSE = If we remove each variable from the model, we expect how low the average accuracy of the model will be.
  • IncNodePurity = Shows how much RSS increases with the removal of each of the variables.

By removing Except, we have the greatest reduction in model accuracy.

Test the model
pred_m7_rf <- predict(m7_rf, newdata = test)
abs_err_m7_rf <- abs(pred_m7_rf - test$Apps)
summary(abs_err_m7_rf)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.1   101.3   222.2   512.3   606.3  6733.1
sd(abs_err_m7_rf)
## [1] 854.9199
Comparisions of models
##### Comparisions of models

 df_err <- data.frame("m1_1"    = abs_err_m1_1,
                       "m3_1"     = abs_err_m3_1,
                       "m4_ridge" = abs_err_m4_ridge,
                       "m5_lasso" = abs_err_m5_lasso,
                       "m6_bag"   = abs_err_6_bagging,
                       "m7_rf"    = abs_err_m7_rf)

 models_comp <- data.frame("mean of Abs Error"   = apply(df_err, 2, mean  ),
                           "median of Abs Error" = apply(df_err, 2, median),
                           "SD of Abs Error"     = apply(df_err, 2, sd    ),
                           "min of Abs Error"    = apply(df_err, 2, min   ),
                           "max of Abs Error"    = apply(df_err, 2, max)  )    
 models_comp
##          mean.of.Abs.Error median.of.Abs.Error SD.of.Abs.Error min.of.Abs.Error
## m1_1              596.9731            390.5073        773.7654        2.0986942
## m3_1              596.2298            398.2408        755.0726        0.2370317
## m4_ridge          637.8083            405.3886        814.0771        5.1078962
## m5_lasso          582.5965            354.4272        757.9307        2.7801727
## m6_bag            446.0091            168.2453        796.3811        0.4188667
## m7_rf             512.2881            222.2484        854.9199        2.1003333
##          max.of.Abs.Error
## m1_1             6972.844
## m3_1             6604.283
## m4_ridge         6530.076
## m5_lasso         6669.743
## m6_bag           6774.290
## m7_rf            6733.078
#boxplot of absolute errors
boxplot(df_err, main = "Abs.Error Dist. of models")

Conclusion: The bagging model is still the best model.

Model 8: Gradient Boost

Tuning: We use Tuninig when we want to play a series of parameters to check different performance.

library("gbm")
## Loaded gbm 2.1.8
library("xgboost")
#Tuning
par_grid <- expand.grid(shrinkage = c(0.1 , 0.2, 0.3),
                        interaction_depth = c(2, 3, 4,5),
                        n_minobsinnode = c(9, 8, 10),
                        bag_fraction = c(0.9, 0.8, 1))

nrow(par_grid)
## [1] 108

We create all the modes and give them to the algorithm Gradient Boosting to check all the different modes to know which one works best. So we use Grid search.

#Grid search
for (i in 1:nrow(par_grid)) {
   set.seed(1234)
     # train model
  gbm_tune <- gbm(formula = Apps ~ .,
                  distribution = "gaussian",
                  data = train,
                  n.trees = 500,
                  interaction.depth = par_grid$interaction_depth[i],
                  shrinkage = par_grid$shrinkage[i],
                  n.minobsinnode = par_grid$n_minobsinnode[i],
                  bag.fraction = par_grid$bag_fraction[i],
                  train.fraction = 0.7,
                  cv.folds = 0,
                  n.cores = NULL, #will use all cores by default
                  verbose = FALSE)  
  #add min training error and trees to grid
  par_grid$optimal_trees[i] <- which.min(gbm_tune$valid.error)
  par_grid$min_RMSE[i]    <- sqrt(min(gbm_tune$valid.error))
}
head(par_grid)
##   shrinkage interaction_depth n_minobsinnode bag_fraction optimal_trees
## 1       0.1                 2              9          0.9           486
## 2       0.2                 2              9          0.9           335
## 3       0.3                 2              9          0.9           353
## 4       0.1                 3              9          0.9           491
## 5       0.2                 3              9          0.9           409
## 6       0.3                 3              9          0.9           183
##   min_RMSE
## 1 833.9233
## 2 850.1395
## 3 952.0971
## 4 834.3308
## 5 836.6362
## 6 887.2769
par_grid_s <- par_grid[order(par_grid$min_RMSE, decreasing = F),]
head(par_grid_s, 10)
##    shrinkage interaction_depth n_minobsinnode bag_fraction optimal_trees
## 17       0.2                 3              8          0.9           384
## 92       0.2                 4              8          1.0           409
## 87       0.3                 2              8          1.0            62
## 98       0.2                 2             10          1.0           472
## 97       0.1                 2             10          1.0           500
## 7        0.1                 4              9          0.9           496
## 55       0.1                 4              8          0.8           326
## 46       0.1                 5              9          0.8            97
## 86       0.2                 2              8          1.0           307
## 56       0.2                 4              8          0.8           129
##    min_RMSE
## 17 779.5410
## 92 784.7155
## 87 789.5404
## 98 793.9368
## 97 800.6561
## 7  801.3926
## 55 805.7055
## 46 806.1049
## 86 807.6660
## 56 808.1937

Conclusion: We select the best elements to build the prediction model in proportion to the least error.

This section tried several times to get the best elements.

Bulding prediction model
#Final model
m8_gbm <- gbm(formula = Apps ~ .,
                distribution = "gaussian",
                data = train,
                n.trees = 500,
                interaction.depth = 8,
                shrinkage = 0.1,
                n.minobsinnode = 5,
                bag.fraction = 0.9,
                train.fraction = 0.7,
                cv.folds = 0,
                n.cores = NULL, #will use all cores by default 
                ) 
summary(m8_gbm)

##                     var     rel.inf
## Accept           Accept 76.47997902
## Enroll           Enroll  8.83760384
## Top10perc     Top10perc  2.93986520
## Top25perc     Top25perc  2.78282312
## Grad.Rate     Grad.Rate  2.48602268
## Outstate       Outstate  1.61672541
## F.Undergrad F.Undergrad  1.40482916
## Books             Books  0.95526587
## Expend           Expend  0.82687152
## perc.alumni perc.alumni  0.46877540
## S.F.Ratio     S.F.Ratio  0.40716333
## Room.Board   Room.Board  0.28931028
## Personal       Personal  0.15782873
## PhD                 PhD  0.13528664
## P.Undergrad P.Undergrad  0.10710970
## Terminal       Terminal  0.07633852
## Private         Private  0.02820160

Note:Shows that Accept and Enroll have the most influence.

Test the model
pred_m8_gbm <- predict(m8_gbm, n.trees = 500, newdata = test)
abs_err_m8_gbm <- abs(pred_m8_gbm - test$Apps)
summary(abs_err_m8_gbm)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.801   70.684  170.540  439.953  436.964 7315.743
sd(abs_err_m8_gbm)
## [1] 808.7992
Comparisions of models
 df_err <- data.frame("m1_1"     = abs_err_m1_1,
                      "m3_1"     = abs_err_m3_1,
                      "m4_ridg" = abs_err_m4_ridge,
                      "m5_las" = abs_err_m5_lasso,
                      "m6_bag"   = abs_err_6_bagging,
                      "m7_rf"    = abs_err_m7_rf,
                      "m8_gbm"   = abs_err_m8_gbm)

 models_comp <- data.frame("mean of Abs Error"   = apply(df_err, 2, mean  ),
                           "median of Abs Error" = apply(df_err, 2, median),
                           "SD of Abs Error"     = apply(df_err, 2, sd    ),
                           "min of Abs Error"    = apply(df_err, 2, min   ),
                           "max of Abs Error"    = apply(df_err, 2, max)  )    
 models_comp
##         mean.of.Abs.Error median.of.Abs.Error SD.of.Abs.Error min.of.Abs.Error
## m1_1             596.9731            390.5073        773.7654        2.0986942
## m3_1             596.2298            398.2408        755.0726        0.2370317
## m4_ridg          637.8083            405.3886        814.0771        5.1078962
## m5_las           582.5965            354.4272        757.9307        2.7801727
## m6_bag           446.0091            168.2453        796.3811        0.4188667
## m7_rf            512.2881            222.2484        854.9199        2.1003333
## m8_gbm           439.9528            170.5401        808.7992        0.8006347
##         max.of.Abs.Error
## m1_1            6972.844
## m3_1            6604.283
## m4_ridg         6530.076
## m5_las          6669.743
## m6_bag          6774.290
## m7_rf           6733.078
## m8_gbm          7315.743

Conclusion: Model Gradient Boost offers the best performance among forecasting models.

#boxplot of absolute errors
boxplot(df_err, main = "Abs.Error Dist. of models")

Model 9: XGBoost Regression

Use Tuning to examine different modes.

x <- model.matrix(Apps ~ ., data = train)[, -1]
y <- train$Apps
#Tuning
par_grid <- expand.grid(eta = c(0.01, 0.1, 0.3),
                        lambda = c(2, 3, 4),
                        max_depth = c(3, 5, 7),
                        subsample = c(0.8, 0.9, 1), 
                        colsample_bytree = c(0.8, 0.9, 1))

nrow(par_grid)
## [1] 243

We create all the modes and give them to the algorithm Gradient Boosting to check all the different modes to know which one works best. So we use Grid search.

#Grid search
for(i in 1:nrow(par_grid )) {
  set.seed(123)
  
  #train model
  xgb_tune <- xgboost(
    data =  x,
    label = y,
    eta = par_grid$eta[i],
    max_depth = par_grid$max_depth[i],
    subsample = par_grid$subsample[i],
    colsample_bytree = par_grid$colsample_bytree[i],
    nrounds = 1000,
    objective = "reg:squarederror",  # for regression models
    verbose = 0,               # silent,
    early_stopping_rounds = 10 # stop if no improvement for 10 consecutive trees
  )
  
  # add min training error and trees to grid
  par_grid$optimal_trees[i] <- which.min(xgb_tune$evaluation_log$train_rmse)
  par_grid$min_RMSE[i]      <- min(xgb_tune$evaluation_log$train_rmse)
}

head(par_grid)
##    eta lambda max_depth subsample colsample_bytree optimal_trees   min_RMSE
## 1 0.01      2         3       0.8              0.8          1000 294.815887
## 2 0.10      2         3       0.8              0.8          1000  12.256836
## 3 0.30      2         3       0.8              0.8          1000   0.051607
## 4 0.01      3         3       0.8              0.8          1000 294.815887
## 5 0.10      3         3       0.8              0.8          1000  12.256835
## 6 0.30      3         3       0.8              0.8          1000   0.051607
par_grid_s <- par_grid[order(par_grid$min_RMSE, decreasing = F),]
head(par_grid_s, 10)
##     eta lambda max_depth subsample colsample_bytree optimal_trees min_RMSE
## 21  0.3      2         7       0.8              0.8           297 0.000613
## 24  0.3      3         7       0.8              0.8           297 0.000613
## 27  0.3      4         7       0.8              0.8           297 0.000613
## 174 0.3      2         5       0.8              1.0           473 0.000641
## 177 0.3      3         5       0.8              1.0           473 0.000641
## 180 0.3      4         5       0.8              1.0           473 0.000641
## 210 0.3      2         7       0.9              1.0           261 0.000653
## 213 0.3      3         7       0.9              1.0           261 0.000653
## 216 0.3      4         7       0.9              1.0           261 0.000653
## 48  0.3      2         7       0.9              0.8           247 0.000662
Bulding prediction model
m9_xgb <- xgboost(data = x,
                  label = y,
                  eta = 0.3,
                  max_depth = 7,
                  lambda = 2,
                  nrounds = 1000,
                  colsample_bytree = 1,
                  subsample = 0.7,
                  objective = "reg:squarederror",
                  verbose = 0)
Test the model
x_test <- model.matrix(Apps ~ ., data = test)[, -1]
pred_m9_xgb <- predict(m9_xgb, x_test)
abs_err_m9_xgb <- abs(pred_m9_xgb - test$Apps)
summary(abs_err_m9_xgb)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.914   55.511  164.383  459.185  454.833 5525.344
sd(abs_err_m9_xgb)
## [1] 801.3323
Comparisions of models
 df_err <- data.frame("m1_1"    = abs_err_m1_1,
                      "m3_1"    = abs_err_m3_1,
                      "m4_ridge" = abs_err_m4_ridge,
                      "m5_lasso"  = abs_err_m5_lasso,
                      "m6_bagging"  = abs_err_6_bagging,
                      "m7_rf"   = abs_err_m7_rf,
                      "m8_gbm"  = abs_err_m8_gbm,
                      "m9_xgb"  = abs_err_m9_xgb)

 models_comp <- data.frame("mean of Abs Error"   = apply(df_err, 2, mean  ),
                           "median of Abs Error" = apply(df_err, 2, median),
                           "SD of Abs Error"     = apply(df_err, 2, sd    ),
                           "min of Abs Error"    = apply(df_err, 2, min   ),
                           "max of Abs Error"    = apply(df_err, 2, max)  )    
 models_comp
##            mean.of.Abs.Error median.of.Abs.Error SD.of.Abs.Error
## m1_1                596.9731            390.5073        773.7654
## m3_1                596.2298            398.2408        755.0726
## m4_ridge            637.8083            405.3886        814.0771
## m5_lasso            582.5965            354.4272        757.9307
## m6_bagging          446.0091            168.2453        796.3811
## m7_rf               512.2881            222.2484        854.9199
## m8_gbm              439.9528            170.5401        808.7992
## m9_xgb              459.1854            164.3834        801.3323
##            min.of.Abs.Error max.of.Abs.Error
## m1_1              2.0986942         6972.844
## m3_1              0.2370317         6604.283
## m4_ridge          5.1078962         6530.076
## m5_lasso          2.7801727         6669.743
## m6_bagging        0.4188667         6774.290
## m7_rf             2.1003333         6733.078
## m8_gbm            0.8006347         7315.743
## m9_xgb            0.9140625         5525.344
#boxplot of absolute errors
boxplot(df_err[,1:4], main = "Abs.Error Dist. of models")

boxplot(df_err[,5:8], main = "Abs.Error Dist. of models")

Conclusion: So far, models XGBoost Regression and Gradient Boost are the best models that work very closely together.

Model 10: Artificial Neural Networks

Before building a neural network model, we need to scale the data.

#min_max normalization
college_data_ann <- college_data
college_data_ann$Private_b <- ifelse(college_data_ann$Private == "Yes",1,0)
college_data_ann <- college_data_ann[, -1]
head(college_data_ann, 22)
##    Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate
## 1  1660   1232    721        23        52        2885         537     7440
## 2  2186   1924    512        16        29        2683        1227    12280
## 3  1428   1097    336        22        50        1036          99    11250
## 4   417    349    137        60        89         510          63    12960
## 5   193    146     55        16        44         249         869     7560
## 6   587    479    158        38        62         678          41    13500
## 7   353    340    103        17        45         416         230    13290
## 8  1899   1720    489        37        68        1594          32    13868
## 9  1038    839    227        30        63         973         306    15595
## 10  582    498    172        21        44         799          78    10468
## 11 1732   1425    472        37        75        1830         110    16548
## 12 2652   1900    484        44        77        1707          44    17080
## 13 1179    780    290        38        64        1130         638     9690
## 14 1267   1080    385        44        73        1306          28    12572
## 15  494    313    157        23        46        1317        1235     8352
## 16 1420   1093    220         9        22        1018         287     8700
## 17 4302    992    418        83        96        1593           5    19760
## 18 1216    908    423        19        40        1819         281    10100
## 19 1130    704    322        14        23        1586         326     9996
## 20 3540   2001   1016        24        54        4190        1512     5130
## 21  713    661    252        25        44         712          23    15476
## 22 7313   4664   1910        20        63        9940        1035     6806
##    Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend
## 1        3300   450     2200  70       78      18.1          12   7041
## 2        6450   750     1500  29       30      12.2          16  10527
## 3        3750   400     1165  53       66      12.9          30   8735
## 4        5450   450      875  92       97       7.7          37  19016
## 5        4120   800     1500  76       72      11.9           2  10922
## 6        3335   500      675  67       73       9.4          11   9727
## 7        5720   500     1500  90       93      11.5          26   8861
## 8        4826   450      850  89      100      13.7          37  11487
## 9        4400   300      500  79       84      11.3          23  11644
## 10       3380   660     1800  40       41      11.5          15   8991
## 11       5406   500      600  82       88      11.3          31  10932
## 12       4440   400      600  73       91       9.9          41  11711
## 13       4785   600     1000  60       84      13.3          21   7940
## 14       4552   400      400  79       87      15.3          32   9305
## 15       3640   650     2449  36       69      11.1          26   8127
## 16       4780   450     1400  78       84      14.7          19   7355
## 17       5300   660     1598  93       98       8.4          63  21424
## 18       3520   550     1100  48       61      12.1          14   7994
## 19       3090   900     1320  62       66      11.5          18  10908
## 20       3592   500     2000  60       62      23.1           5   4010
## 21       3336   400     1100  69       82      11.3          35  42926
## 22       2540    96     2000  83       96      18.3          14   5854
##    Grad.Rate Private_b
## 1         60         1
## 2         56         1
## 3         54         1
## 4         59         1
## 5         15         1
## 6         55         1
## 7         63         1
## 8         73         1
## 9         80         1
## 10        52         1
## 11        73         1
## 12        76         1
## 13        74         1
## 14        68         1
## 15        55         1
## 16        69         1
## 17       100         1
## 18        59         1
## 19        46         1
## 20        34         0
## 21        48         1
## 22        70         0
max <- apply(college_data_ann, 2, max)
min <- apply(college_data_ann, 2, min)
scaled_college_data <- as.data.frame(scale(college_data_ann, center = min, scale = max - min))
head(scaled_college_data, 22)
##           Apps      Accept      Enroll  Top10perc Top25perc F.Undergrad
## 1  0.072687934 0.062125107 0.107912537 0.23157895 0.4725275 0.087163535
## 2  0.096901901 0.099185947 0.075035394 0.15789474 0.2197802 0.080751651
## 3  0.062008010 0.054895030 0.047349379 0.22105263 0.4505495 0.028472575
## 4  0.015467477 0.014835047 0.016045304 0.62105263 0.8791209 0.011776282
## 5  0.005155826 0.003963153 0.003146138 0.15789474 0.3846154 0.003491620
## 6  0.023293284 0.021797344 0.019348749 0.38947368 0.5824176 0.017108939
## 7  0.012521291 0.014353042 0.010696870 0.16842105 0.3956044 0.008792534
## 8  0.083690098 0.088260497 0.071417335 0.37894737 0.6483516 0.046184611
## 9  0.044054689 0.041077549 0.030202926 0.30526316 0.5934066 0.026472829
## 10 0.023063113 0.022814910 0.021551046 0.21052632 0.3846154 0.020949721
## 11 0.076002394 0.072461440 0.068743118 0.37894737 0.7252747 0.053675724
## 12 0.118353819 0.097900600 0.070630801 0.45263158 0.7472527 0.049771458
## 13 0.050545505 0.037917738 0.040113261 0.38947368 0.6043956 0.031456323
## 14 0.054596511 0.053984576 0.055057417 0.45263158 0.7032967 0.037042915
## 15 0.019012107 0.012907027 0.019191443 0.23157895 0.4065934 0.037392077
## 16 0.061639737 0.054680805 0.029101778 0.08421053 0.1428571 0.027901219
## 17 0.194310178 0.049271637 0.060248545 0.86315789 0.9560440 0.046152869
## 18 0.052248769 0.044772922 0.061035079 0.18947368 0.3406593 0.053326562
## 19 0.048289831 0.033847472 0.045147082 0.13684211 0.1538462 0.045930675
## 20 0.159232150 0.103309769 0.154318075 0.24210526 0.4945055 0.128586846
## 21 0.029093587 0.031544559 0.034135599 0.25263158 0.3846154 0.018188167
## 22 0.332919026 0.245929734 0.294950448 0.20000000 0.5934066 0.311103352
##     P.Undergrad  Outstate Room.Board      Books   Personal       PhD   Terminal
## 1  0.0245477444 0.2634298  0.2395965 0.15775401 0.29770992 0.6739130 0.71052632
## 2  0.0561483856 0.5134298  0.7361286 0.29144385 0.19083969 0.2282609 0.07894737
## 3  0.0044882070 0.4602273  0.3105296 0.13547237 0.13969466 0.4891304 0.55263158
## 4  0.0028394779 0.5485537  0.5784994 0.15775401 0.09541985 0.9130435 0.96052632
## 5  0.0397526906 0.2696281  0.3688525 0.31372549 0.19083969 0.7391304 0.63157895
## 6  0.0018319212 0.5764463  0.2451135 0.18003565 0.06488550 0.6413043 0.64473684
## 7  0.0104877490 0.5655992  0.6210593 0.18003565 0.19083969 0.8913043 0.90789474
## 8  0.0014197390 0.5954545  0.4801387 0.15775401 0.09160305 0.8804348 1.00000000
## 9  0.0139683994 0.6846591  0.4129887 0.09090909 0.03816794 0.7717391 0.78947368
## 10 0.0035264484 0.4198347  0.2522068 0.25133690 0.23664122 0.3478261 0.22368421
## 11 0.0049919853 0.7338843  0.5715637 0.18003565 0.05343511 0.8043478 0.84210526
## 12 0.0019693153 0.7613636  0.4192938 0.13547237 0.05343511 0.7065217 0.88157895
## 13 0.0291733455 0.3796488  0.4736759 0.22459893 0.11450382 0.5652174 0.78947368
## 14 0.0012365468 0.5285124  0.4369483 0.13547237 0.02290076 0.7717391 0.82894737
## 15 0.0565147699 0.3105372  0.2931904 0.24688057 0.33572519 0.3043478 0.59210526
## 16 0.0130982368 0.3285124  0.4728878 0.15775401 0.17557252 0.7608696 0.78947368
## 17 0.0001831921 0.8997934  0.5548550 0.25133690 0.20580153 0.9239130 0.97368421
## 18 0.0128234486 0.4008264  0.2742749 0.20231729 0.12977099 0.4347826 0.48684211
## 19 0.0148843600 0.3954545  0.2064943 0.35828877 0.16335878 0.5869565 0.55263158
## 20 0.0692008244 0.1441116  0.2856242 0.18003565 0.26717557 0.5652174 0.50000000
## 21 0.0010075567 0.6785124  0.2452711 0.13547237 0.12977099 0.6630435 0.76315789
## 22 0.0473551637 0.2306818  0.1197982 0.00000000 0.26717557 0.8152174 0.94736842
##    S.F.Ratio perc.alumni     Expend  Grad.Rate Private_b
## 1  0.4182306    0.187500 0.07267140 0.55555556         1
## 2  0.2600536    0.250000 0.13838671 0.51111111         1
## 3  0.2788204    0.468750 0.10460535 0.48888889         1
## 4  0.1394102    0.578125 0.29841461 0.54444444         1
## 5  0.2520107    0.031250 0.14583294 0.05555556         1
## 6  0.1849866    0.171875 0.12330575 0.50000000         1
## 7  0.2412869    0.406250 0.10698060 0.58888889         1
## 8  0.3002681    0.578125 0.15648387 0.70000000         1
## 9  0.2359249    0.359375 0.15944351 0.77777778         1
## 10 0.2412869    0.234375 0.10943126 0.46666667         1
## 11 0.2359249    0.484375 0.14602145 0.70000000         1
## 12 0.1983914    0.640625 0.16070654 0.73333333         1
## 13 0.2895442    0.328125 0.08961864 0.71111111         1
## 14 0.3431635    0.500000 0.11535054 0.64444444         1
## 15 0.2305630    0.406250 0.09314382 0.50000000         1
## 16 0.3270777    0.296875 0.07859068 0.65555556         1
## 17 0.1581769    0.984375 0.34380832 1.00000000         1
## 18 0.2573727    0.218750 0.09063661 0.54444444         1
## 19 0.2412869    0.281250 0.14556902 0.40000000         1
## 20 0.5522788    0.078125 0.01553339 0.26666667         0
## 21 0.2359249    0.546875 0.74914698 0.42222222         1
## 22 0.4235925    0.218750 0.05029502 0.66666667         0
hist(scaled_college_data$Apps)

We convert the new data into two parts Train and Test.

#Divide Dataset into Train and Test
set.seed(12)
train_cases <- sample(1:nrow(scaled_college_data), nrow(scaled_college_data) * 0.7)
train_ann <- scaled_college_data[train_cases, ]
test_ann <- scaled_college_data[-train_cases, ]
dim(train_ann)
## [1] 541  18
dim(test_ann)
## [1] 233  18
summary(train_ann)
##       Apps               Accept             Enroll          Top10perc     
##  Min.   :0.0008746   Min.   :0.000964   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0319937   1st Qu.:0.028010   1st Qu.:0.03256   1st Qu.:0.1474  
##  Median :0.0706624   Median :0.058216   Median :0.06339   Median :0.2421  
##  Mean   :0.1389433   Mean   :0.107262   Mean   :0.12192   Mean   :0.2869  
##  3rd Qu.:0.1747457   3rd Qu.:0.135551   3rd Qu.:0.14299   3rd Qu.:0.3684  
##  Max.   :0.9257929   Max.   :0.804627   Max.   :1.00000   Max.   :0.9895  
##    Top25perc       F.Undergrad        P.Undergrad          Outstate     
##  Min.   :0.0000   Min.   :0.004539   Min.   :0.000000   Min.   :0.0124  
##  1st Qu.:0.3407   1st Qu.:0.028028   1st Qu.:0.004305   1st Qu.:0.2526  
##  Median :0.4945   Median :0.051359   Median :0.016121   Median :0.3931  
##  Mean   :0.5188   Mean   :0.117147   Mean   :0.038893   Mean   :0.4175  
##  3rd Qu.:0.6703   3rd Qu.:0.142331   3rd Qu.:0.044928   3rd Qu.:0.5480  
##  Max.   :1.0000   Max.   :1.000000   Max.   :1.000000   Max.   :1.0000  
##    Room.Board          Books             Personal             PhD        
##  Min.   :0.01576   Min.   :0.006239   Min.   :0.007634   Min.   :0.0000  
##  1st Qu.:0.28358   1st Qu.:0.165775   1st Qu.:0.099237   1st Qu.:0.5870  
##  Median :0.37831   Median :0.180036   Median :0.145038   Median :0.7283  
##  Mean   :0.40646   Mean   :0.203740   Mean   :0.168025   Mean   :0.6994  
##  3rd Qu.:0.51734   3rd Qu.:0.224599   3rd Qu.:0.221374   3rd Qu.:0.8370  
##  Max.   :1.00000   Max.   :1.000000   Max.   :0.711908   Max.   :1.0000  
##     Terminal        S.F.Ratio       perc.alumni         Expend       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.6053   1st Qu.:0.2413   1st Qu.:0.2188   1st Qu.:0.06786  
##  Median :0.7500   Median :0.3003   Median :0.3281   Median :0.09693  
##  Mean   :0.7293   Mean   :0.3122   Mean   :0.3589   Mean   :0.12124  
##  3rd Qu.:0.8947   3rd Qu.:0.3780   3rd Qu.:0.4844   3rd Qu.:0.14474  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :0.80148  
##    Grad.Rate        Private_b     
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.4889   1st Qu.:0.0000  
##  Median :0.6111   Median :1.0000  
##  Mean   :0.6212   Mean   :0.7264  
##  3rd Qu.:0.7556   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000
summary(test_ann)
##       Apps             Accept            Enroll          Top10perc     
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00173   Min.   :0.0000  
##  1st Qu.:0.03241   1st Qu.:0.02999   1st Qu.:0.03256   1st Qu.:0.1474  
##  Median :0.06274   Median :0.05431   Median :0.06009   Median :0.2211  
##  Mean   :0.11543   Mean   :0.09174   Mean   :0.10422   Mean   :0.2633  
##  3rd Qu.:0.13773   3rd Qu.:0.10551   3rd Qu.:0.11656   3rd Qu.:0.3158  
##  Max.   :1.00000   Max.   :1.00000   Max.   :0.91851   Max.   :1.0000  
##    Top25perc        F.Undergrad       P.Undergrad           Outstate     
##  Min.   :0.04396   Min.   :0.00000   Min.   :0.0000916   Min.   :0.0000  
##  1st Qu.:0.37363   1st Qu.:0.02552   1st Qu.:0.0043050   1st Qu.:0.2696  
##  Median :0.49451   Median :0.04815   Median :0.0162125   Median :0.4069  
##  Mean   :0.50403   Mean   :0.10177   Mean   :0.0394113   Mean   :0.4227  
##  3rd Qu.:0.62637   3rd Qu.:0.10665   3rd Qu.:0.0398443   3rd Qu.:0.5429  
##  Max.   :1.00000   Max.   :0.94839   Max.   :0.5019922   Max.   :0.9174  
##    Room.Board         Books           Personal            PhD         
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.02174  
##  1st Qu.:0.2900   1st Qu.:0.1667   1st Qu.:0.08397   1st Qu.:0.58696  
##  Median :0.3925   Median :0.1800   Median :0.14504   Median :0.73913  
##  Mean   :0.4062   Mean   :0.1976   Mean   :0.16356   Mean   :0.71091  
##  3rd Qu.:0.5147   3rd Qu.:0.2246   3rd Qu.:0.21756   3rd Qu.:0.85870  
##  Max.   :0.8149   Max.   :0.5811   Max.   :1.00000   Max.   :1.00000  
##     Terminal        S.F.Ratio        perc.alumni         Expend        
##  Min.   :0.1184   Min.   :0.02145   Min.   :0.0000   Min.   :0.005542  
##  1st Qu.:0.6316   1st Qu.:0.23861   1st Qu.:0.1719   1st Qu.:0.067073  
##  Median :0.8026   Median :0.29491   Median :0.3125   Median :0.100911  
##  Mean   :0.7420   Mean   :0.30624   Mean   :0.3480   Mean   :0.124271  
##  3rd Qu.:0.8816   3rd Qu.:0.35925   3rd Qu.:0.4844   3rd Qu.:0.143137  
##  Max.   :1.0000   Max.   :0.67828   Max.   :0.9375   Max.   :1.000000  
##    Grad.Rate         Private_b     
##  Min.   :0.05556   Min.   :0.0000  
##  1st Qu.:0.46667   1st Qu.:0.0000  
##  Median :0.61111   Median :1.0000  
##  Mean   :0.60291   Mean   :0.7339  
##  3rd Qu.:0.74444   3rd Qu.:1.0000  
##  Max.   :1.00000   Max.   :1.0000

Note: train and test data distribution is close together which is good.

Bulding prediction model
library("neuralnet")
## Warning: package 'neuralnet' was built under R version 4.0.5
model_nn <- neuralnet(Apps ~ ., data = train_ann, hidden = c(3, 5), act.fct = "logistic", linear.output = T)
plot(model_nn, col.hidden = "dark green", col.hidden.synapse = "dark green", fill = "light blue")
Test the model
test_pred_nn <- compute(model_nn, test_ann[, 2:18])
y_pred_nn <- (test_pred_nn$net.result[,1] * (max(college_data_ann$Apps) - min(college_data_ann$Apps))) + min(college_data_ann$Apps)
y_test_nn <- (test_ann$Apps * (max(college_data_ann$Apps) - min(college_data_ann$Apps))) + min(college_data_ann$Apps)

plot(y_test_nn, y_pred_nn, col ='blue', pch = 16, ylab = "predicted values", xlab = "real values")
abline(0, 1, col = "red")

plot(y_test_nn, col = "red", type = "l")
lines(y_pred_nn, col = "blue")

#Calculate Root Mean Square Error (RMSE)
rmse_nn <- (sum((y_test_nn - y_pred_nn) ^ 2) / length(y_test_nn)) ^ 0.5
rmse_nn
## [1] 771.2675
abs_err_ann <- abs(y_pred_nn - y_test_nn)
summary(abs_err_ann)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    2.656   86.213  207.209  416.223  453.307 5334.065
sd(abs_err_ann)
## [1] 650.7141
 df_err <- data.frame("m1_1"    = abs_err_m1_1,
                      "m3_1"    = abs_err_m3_1,
                      "m4_ridge" = abs_err_m4_ridge,
                      "m5_lasso"  = abs_err_m5_lasso,
                      "m6_bagging"  = abs_err_6_bagging,
                      "m7_rf"   = abs_err_m7_rf,
                      "m8_gbm"  = abs_err_m8_gbm,
                      "m9_xgb"  = abs_err_m9_xgb,
                      "m10_ann" = abs_err_ann)

 models_comp <- data.frame("mean of Abs Error"   = apply(df_err, 2, mean  ),
                           "median of Abs Error" = apply(df_err, 2, median),
                           "SD of Abs Error"     = apply(df_err, 2, sd    ),
                           "min of Abs Error"    = apply(df_err, 2, min   ),
                           "max of Abs Error"    = apply(df_err, 2, max)  )    
 models_comp
##            mean.of.Abs.Error median.of.Abs.Error SD.of.Abs.Error
## m1_1                596.9731            390.5073        773.7654
## m3_1                596.2298            398.2408        755.0726
## m4_ridge            637.8083            405.3886        814.0771
## m5_lasso            582.5965            354.4272        757.9307
## m6_bagging          446.0091            168.2453        796.3811
## m7_rf               512.2881            222.2484        854.9199
## m8_gbm              439.9528            170.5401        808.7992
## m9_xgb              459.1854            164.3834        801.3323
## m10_ann             416.2235            207.2094        650.7141
##            min.of.Abs.Error max.of.Abs.Error
## m1_1              2.0986942         6972.844
## m3_1              0.2370317         6604.283
## m4_ridge          5.1078962         6530.076
## m5_lasso          2.7801727         6669.743
## m6_bagging        0.4188667         6774.290
## m7_rf             2.1003333         6733.078
## m8_gbm            0.8006347         7315.743
## m9_xgb            0.9140625         5525.344
## m10_ann           2.6555347         5334.065

Conclusion:The three models XGBoost Regression, Gradient Boost and Artificial Neural Networks work almost closely together, but model Artificial Neural Networks performs best.

boxplot(df_err[,1:5], main = "Abs.Error Dist. of models")

boxplot(df_err[,6:9], main = "Abs.Error Dist. of models")

Questions of Modeling

  1. On training data, build at least 6 models to predict the response variable. Report actions and results.
  2. Compare the outputs of the models using the indicators presented in the class. Report actions and results.

Answer:In this section, 10 forecast models were made and their reports were brought and we compared them. As a result, the best model was Model Artificial Neural Networks.

5.Evaluation

In the final evaluation, we compare the results of the created prediction models to select the best model for deployment.

par(mfrow = c(3, 3))
for (i in 1:9) {
   hist(df_err[,i], xlab ="", main = colnames(df_err)[i],probability = T)
   lines(density(df_err[,i]), col = "red")
}

par(mfrow = c(1, 5))
 for (i in 1:5) {
      boxplot(df_err[, i], xlab = "", main =  colnames(df_err)[i] )
 }

 for (i in 6:9) {
      boxplot(df_err[, i], xlab = "", main =  colnames(df_err)[i] )
 }

Conclusion: Due to the errors created, it seems that the best model for deploying model Artificial Neural Networks.

Questions of Evaluation

  1. The proposed models are evaluated on experimental data using common indicators in machine learning. Report actions and results.
  2. What suggestions do you have for testing the results in a real environment?

Answer: The models were tested and the best model was Artificial Neural Networks. Since this was my first data analysis project and I was not fully connected to the actual project environment, I could not have a good answer to question 2.

6.Deployment

Model Artificial Neural Networks became our best model, so we use this model for forecasting.

Questions of Deployment

  1. Examine the challenges of algorithm development.
  2. What solutions do you have to solve them?
  3. What requirements do you need to provide those solutions?

The algorithm may not work well in real space, so it is better to check more models.

7.Conclusion

Questions of Conclusion

  1. What did learning this project teach you?
  2. What challenges did you face? How did you solve them?

Answer: In this project, I got acquainted with different parts of a data science project. I was able to evaluate and test myself in accordance with what I learned in this course.
I created various regression prediction models and worked on them to report the best.
Finally, I think this course has been one of the best courses of my studies.

Thank you for all the hard work you put into us this year.